78

Is the expansion of a wildcard in Bash guaranteed to be in alphabetical order? I am forced to split a large file into 10 Mb pieces so that they can be be accepted by my Mercurial repository.

So I was thinking I could use:

split -b 10485760 Big.file BigFilePiece.

and then in place of:

cat BigFile | bigFileProcessor

I could do:

cat BigFilePiece.* | bigFileProcessor

in its place.

However, I could not find anywhere that guaranteed that the expansion of the asterisk (aka wildcard, aka *) would always be in alphabetical order so that .aa came before .ab (as opposed to be timestamp ordering or something like that).

Also, are there any flaws in my plan? How great is the performance cost of cating the file together?

Daniel Alder
  • 545
  • 1
  • 10
  • 19
Sled
  • 927
  • 1
  • 7
  • 11
  • 4
    For sure you are taking the wrong approach. If the admin put a limit for the size of files you have in the repository, then you should talk with him. Talking about expansion - I have always saw that the expansion is alphanumerical. – Mircea Vutcovici Mar 15 '10 at 19:58
  • 1
    You can always pipe through `sort` if you need any additional order manipulation. – Warner Mar 15 '10 at 20:45
  • 2
    Please note that Mercurial can manage files of any size, limited by the amount of RAM you have. You get a warning if you add a big file, since Mercurial assumes that it can hold the file in memory. For merges, Mercurial needs to hold two files in memory. Machines with small amounts of RAM may therefore have trouble checking out the file. I just tested it, and `hg commit` on a `N` MB file requires about `3 * N` MB of RAM and `hg update` requires about `2 * N` MB of RAM. This is with Mercurial 1.5 on Linux. – Martin Geisler Mar 16 '10 at 09:16
  • 1
    @Warner `sort` sorts lines, globbing does not return lines thus `sort` does *not* work as is. – stefanct Feb 06 '21 at 17:04

3 Answers3

97

Yes, globbing expansion is alphabetical.

From the Bash man page:

Pathname Expansion

After word splitting, unless the -f option has been set, bash scans each word for the characters *, ?, and [. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names matching the pattern.

Olaf Dietsche
  • 275
  • 1
  • 7
Dennis Williamson
  • 62,149
  • 16
  • 116
  • 151
  • @Dennis Williamson, Any idea if this would still be true if a user has a different language set? – Zoredache Mar 15 '10 at 22:35
  • 10
    @Zoredache: It's actually specified by POSIX: http://opengroup.org/onlinepubs/007908775/xsh/glob.html "The pathnames are in sort order as defined by the current setting of the LC_COLLATE category, see the XBD specification, LC_COLLATE [http://opengroup.org/onlinepubs/007908775/xbd/locale.html#tag_005_003_002]" and it's why you should do things like `ls -l [[:lower:]]` instead of `ls -l [a-z]`. – Dennis Williamson Mar 16 '10 at 00:31
  • 1
    Note that the order is alphabetical so BigFilePiece.10 will come before BigFilePiece.2 – Ken Jul 24 '14 at 13:14
  • @DennisWilliamson - Why two pairs of square brackets? One seems to work exactly the same to me. – ArtOfWarfare Mar 14 '18 at 21:08
  • 2
    @ArtOfWarfare: Try this: `mkdir lctest; cd lctest; touch w; touch z; ls -l [:lower:]; echo =====; ls -l [[:lower:]]`. The "z" file is only listed by the second `ls` because it's asking for lower case single-letter filenames. The first `ls` - the one without the outer square brackets - is asking for single-character file names from the list of characters ":", "l", "o", "w", "e", and "r". In both cases the outermost square brackets delimit a bracket expression which lists characters and classes. In the case of `[[:lower:]]`, the inner square brackets, colons and word name a character class. ... – Dennis Williamson Mar 15 '18 at 17:06
  • ... [man 7 regex](https://linux.die.net/man/7/regex) – Dennis Williamson Mar 15 '18 at 17:06
  • @DennisWilliamson - Interesting. I could have sworn that `[:digit:]` was working yesterday, but now it seems like only `[[:digit:]]` is working. – ArtOfWarfare Mar 16 '18 at 19:35
  • Great to know. Using this to run multiple mySQL file imports sequentially. – Markus Zeller Feb 02 '21 at 10:30
6

It is documented behavior for bash so you can depend upon it in your scripts. It also has been true of other Bourne compatible shells for a very long time ... though there may be corner cases regarding case folding or non-alphanumeric characters.

(The resulting list, in bash will be in almost "ASCII-betical" order --- except that lower and upper case letters will be collated together as if there were no case differences but with lower case collated before their upper case equivalents. All non-alphabetics should collate into the same order as they appear in ASCII).

As others have pointed out this could be perturbed by your language related environment settings: LANG generally and LC_COLLATE more specifically. In might be safest to run commands that depend on glob expansion ordering under an env command to clear the environment (using -i or -u as appropriate) or to pipe the results through sort to ensure robust sequencing.

Jim Dennis
  • 807
  • 1
  • 10
  • 22
  • 5
    It appears that all non-alphanumerics are *ignored* in the sorting process. So "=", "_", "~" cannot be used to force a file to start or end (respectively) the list. – Otheus Jan 20 '12 at 11:46
4

While glob expansions are sorted alphabetically, they also obey the shell's langage setting.

Make sure to set this to "C" in your script if you intend this to be portable.

adaptr
  • 16,576
  • 23
  • 34