Sorting words separated by commas

T. Kurt Bond

2021-07-16 13:05

I often have lists of "words", separated by commas, possibly on multiple lines, like this example from a Makefile:

#    bookman, schoolbook, palatino, times,
#    helvetica, helvetica-narrow, optima, cormorant-garamond,
#    or ebgaramond.

I find these lists are always getting out of order, or they end up with some short lines and some long lines. I want to be able to reformat them automatically, like this:

#    bookman, cormorant-garamond, ebgaramond, helvetica, helvetica-narrow,
#    optima, palatino, schoolbook, or times.

So, I wrote three scripts to deal with them, sort-with-commas, strip-leading-hash to get rid of the leading hashes and spaces, and prefix to put the leading hashes and spaces back.

Now, above I said "words", because really it's anything separated by commas, so the "words" can contain space, etc.

Also, notice that the period after "ebgaramond" and the "or" before "ebgaramond"` in the original list disappear, and an "or " appears before the new end of the list, "times", and a period follows it. And you can have have the same situation with "and". So, the -p option to sort-with-commas adds a period after the last word, the -a option adds and "and " before the last word, and the -o option adds an "or " before the last word. If you are sorting only part of a list, you want to have a comma after the last "word", so there is the option -f for that. And to remove the the period from the original list, so it doesn't end up in the middle of the new list, or to remove "and " or "or ", there is the -r option.

The default is to return the sorted list as one long line, but you can easily reformat it to multiple lines by running it through the Unix command fmt.

Although in this case the list is prefixed with a "#" and some spaces because it comes from a comment in a Makefile, you have to remove those to sort the list. I wrote the script strip-leading-hash to do that, too, rather than having to remember the sed command to so that all the time.

So, to sort the original list I'd run the command

strip-leading-hash | sort-with-commas -r -p -o | fmt | prefix "#    "

which means “strip the leading hashes and spaces, remove the trailing period and the "and " or "or ", add a final period after the last word, add an "or " before the final word, reformat as a paragraph, and prefix the lines with the hash and spaces.”

When I use this I'm usually in emacs and using M-| to run it on the region (the currently selected text), often with the C-u to replace the region with results.

Here's the main script, sort-with-commas:

sort-with-commas (Source)

#! /usr/bin/env bash
###############################################################################
# Sort a list of words that are seperated by commas, optionally followed by
# a newline into a single line seperated by commas followed by spaces.
#
# For example: it translates (ignore the "# +" at the beginning of lines)
#    bookman, schoolbook,palatino,
#    times, helvetica, helvetica-narrow,
# to
#    bookman, helvetica, helvetica-narrow, palatino, schoolbook, times
###############################################################################

AND_OPT=off                     # Insert "and " before last word.
FINAL_OPT=off                   # Leave "," after last word.
OR_OPT=off                      # Insert "or " before last word.
PERIOD_OPT=off                  # Insert a final period after last word.
REMOVE_AND_OR_PERIOD_OPT=off


let errors=0
while getopts "?afhopr" opt
do
    case "$opt" in
        (\?|h) let errors++ ;;
        (a) AND_OPT=on ;;
        (f) FINAL_OPT=on ;;
        (o) OR_OPT=on ;;
        (p) PERIOD_OPT=on ;;
        (r) REMOVE_AND_OR_PERIOD_OPT=on ;;
    esac
done

shift $((OPTIND-1))

[[ $# > 0 ]] || [[ $errors > 0 ]] && {
    cat <<EOF
usage: sort-with-commas [OPTION]

This reads its standard input and sorts a line or multiple lines with
"words" separated by commas, then reassembles the line, words
separated by a comma and s space, optionally leaving a final comma
after the last word, or a period, and optionally putting "and " or "or
" before the last word.

Options

-? -h   This message.
-a      Insert "and " before last word.
-f      Leave final comma after last word.
-o      Insert "or " before last word.
-p      Insert a period after the last word.
-r      Remove "and " or "or " that occur at the beginning of a "word" in the
        original list.

Note that combining -a and -o, or -f and -p do what you say, but the results
are silly.
EOF
    exit 1
}

tr ',' '\n' | sed -E -e 's/^[ \t]+//' -e '/^$/d' |
    ([[ "$REMOVE_AND_OR_PERIOD_OPT" = "on" ]] &&
         sed -E -e 's/^(and|or)[ \t]+//' -e 's/\.[ \t]*$//' || cat) |
    sort -u |
    sed -E -e 's/$/,/' |
    (if [[ "$AND_OPT" = "on" ]]; then sed -e '$s/^/and /'; else cat; fi) |
    (if [[ "$FINAL_OPT" = "on" ]]; then cat; else sed -e '$s/,//'; fi) |
    (if [[ "$OR_OPT" = "on" ]]; then sed -e '$s/^/or /'; else cat; fi) |
    (if [[ "$PERIOD_OPT" = "on" ]]; then sed -e '$s/$/./'; else cat; fi) |
    tr '\n' ' ' | sed -E -e 's/[ ]$//'

Here's strip-leading-hash:

strip-leading-hash (Source)

#! /usr/bin/env bash

sed -E -e 's/^#[ \t]*//'

And here's prefix:

prefix (Source)

#! /usr/bin/env bash

sed "s/^/$1/"

Lacking Natural Simplicity

Sorting words separated by commas

About

Links

Lacking Natural Simplicity

Comments

About

Links