This is another post in a loose sequence of articles around I18N and Unicode support in qooxdoo, and qooxdoo's tool chain particularly. (The other so far were 1 and 2).
Wesley Chun writes in his excellent Core Python Programming in the section "Real-Life Lessons Learned" about adding Unicode support later to a system
"The retrofit of the entire system would be extremely tedious and time-consuming." (p.203, 2nd ed.)
and later adds
"Fixing Unicode bugs everywhere leads to code instability and the distinct possibility of introducing new bugs." (p.204, 2nd ed.)
You might not be surprised to hear that my experiences largely match his statements.
Using a unicode name space
This time, I started off by using a name space for a qooxdoo application that contained real unicode characters. To create the project, I did something like this:
create-application.py -n höräßlich
(Don't worry about the name space, it's a made-up word that contains three "funny" characters).
There are two main areas where the name space of an application makes its mark:
- it is inserted into various source files where the name space of the application is needed, e.g. in class files, in the first parameter to qx.Class.define().
- it is used in the file system, to create proper paths to the class files (as in .../source/class/<namespace>/....), to resource files, asf.
I had to tweak create-application.py a bit, but after that the resulting skeleton application looked just fine. Then I invoked generate.py source, and that's were the real trouble started...
The generator has to deal with the exact two areas for unicode name spaces as listed above. Text files are read and processed, and there is interaction with the file system which always involves paths where the "funny" name space is part of. I was confident that the text file handling would be consistent, since we read and write all text files as being UTF-8 encoded. But there was no such consistency regarding file path handling, which showed after I followed and fixed exception after exception popping up in the generator. Then I sat back to think about the issues from a general perspective. All critical code sections involved strings that now contained unicode characters which were not accounted for, either when doing operations on them, printing them, or passing them into module functions. The cure was almost always decoding the string from UTF-8. So I thought whether there would be a single or a few general approaches to the issue...
sys.setdefaultencoding()
The sys module of Python has a nice function, setdefaultencoding(), which allows you to switch from the default ASCII to other encodings such as UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode. And I'll tell you what, I switched to UTF-8 and it worked like a charm! But there is a catch: You can't use this function in your application code, seriously
. This function is only available at Python start-up time, when Python scans the environment. It has to be called in a system-wide module, like /usr/local/lib/python2.5/site-packages/sitecustomize.py, where sitecustomize.py is a pre-defined name (you cannot make it up). After this module has been evaluated, the setdefaultencoding() function is removed from the sys module. There is no way of using it afterwards. Isn't that weird?! So I had to go on and look at the various places where we deal with strings in the tool chain.
String Literals
One aspect of the issues are string literals. So what about changing all string literals in all tool chain modules to Unicode strings?! That looked like a safe path to ensure we are treating unicode strings in any place. It turns out that the tool chain in total uses more than 13,000 string literals (gasp!). That was, um, more than I expected. Given the fact that all those strings have to be represented in memory at run time, there seems to be some opportunity to improve memory footprint here. But that's a different issue.
But concerning unicode support, whatever memory footprint our string literals have, converting them all to unicode strings would double this footprint. Currently, we only have ASCII string literals that use one byte per character. Turning these into unicode would result, at best, in each character being represented by two bytes in memory, since Python normally uses UCS-2 internally to represent unicode. - I didn't need any more estimates to figure that that would be carrying it too far.
So we're back to handling and converting strings on an individual and per-case basis.
The os module
Another aspect of unicode treatment are the Python modules we use. The os module is all over the place in the tool chain. The important discovery here was that the methods returning strings can often be parametrized in a helpful way. Take os.listdir(): If called with a normal string it will return an array of strings. If called with a unicode argument, it will return unicode strings. (Try calling os.listdir() with a normal string representing a directory that contains unicode entries, either files or subdirectories. On Linux the returned list will contain those entries in UTF-8. On Windows it actually looks like UCS-2 encoding, the same encoding Python uses internally. But now try to decode a UCS-2 encoded string into an internal unicode string - you'll be surprised).
This sort of polymorphic behaviour holds for similar functions like os.walk() and os.path.join(). os.getcwd() can be replaced by os.getcwdu(), in order to obtain the current working directory as a unicode object. So I will crawl over the code base eventually and look at each line the os module is called, rectifying parameters or switching to unicode-enhanced functions.
... to be continued
And that's not the end of the story. Uses of other modules have to be inspected, corrected, and calls to their functions possibly be flanked by unicode de- and encoding actions. While file I/O should be fine, terminal I/O might not, so calls to print and <stream>.write() will have to be inspected. Platform-dependencies might come into play. And how to obtain unicode command line parameters...