Thursday, October 30, 2014

CrashPlan: Don't forget the umlaut

Good thing that I tested a full restore from my new CrashPlan backup, as I found that something was missing: all filenames containing non-ASCII characters were omitted from the backup!

It turns out that Java is to blame -- at least in part.  Filenames are, after all, strings, and Java treats them as such; any filename returned by a system call (as an array of bytes) is decoded into a String object (as an array of code points) based on the character encoding of the current locale.  The same goes in reverse: any String filename passed as argument to a system call is encoded back.  If all goes well, both operations should be exact opposites, and cancel each other; the string we give to open(2) should be byte-for-byte identical to the one we got from readdir(3).

If, however, the filename is not properly encoded accordingly with the current locale, it may contain sequences of bytes which are invalid, and cannot be converted into a code point.  (This is typically the case with ISO-8859-1 filenames under a UTF-8 locale.)  In that case, the Unicode replacement character (U+FFFD) is used instead -- that's what it's for, after all.  Consequently, the re-encoded filename will not be identical to the original, and will refer to a (most probably) inexistant file with a weird name.  (The effects can be perplexing at first, such as listed files not really existing.)

If the C locale is in effect (typically because $LANG -- or $LC_ALL or $LC_CTYPE -- was explicitly set to "C", or left undefined, either of which can often be the case for init scripts, or when using sudo), then only ASCII characters are allowed; any filename with non-ASCII characters (be it encoded with UTF-8 or ISO-8859-1) will definitely not work.

CrashPlan actually accounts for all of this, and makes sure to set $LANG to "en_US.UTF-8" if it was previously undefined.  (It also enforces UTF-8 as the current codeset.  If your filesystem is still using a legacy encoding, welcome to the 21st century.)  This ensures that UTF-8 filenames will be properly handled.  Assuming, of course, that en_US.UTF-8 is a valid locale.

That's the catch: on a Debian system, locales are not installed as-is, but rather generated on demand (to save space).  And it's quite possible for en_US.UTF-8 to not have been generated, if another UTF-8 locale is being used in its stead.  In that case, failure to set $LANG will result in an invalid locale, falling back to the C locale, under which non-ASCII filenames cannot be handled properly.

CrashPlan's fault in all this is quite simple: it does not appear to output any error or warning message in this situation.  Seems like a serious oversight to me.

Setting $LANG to the proper locale in bin/run.conf would do the trick, but according to Code42, this file will be overwritten when upgrading to a new version.  (And unlike that other bug which prevents the client from launching, this one could easily go unnoticed if reintroduced.)  It's probably best to play it safe, and just generate the damn US locale.

Problem solved.

No comments: