Unicode in qooxdoo
Filed under: Technical, Tool Chain
By Thomas Herchenröder @ January 20, 2009 3:12 pm
In a recent post I wrote about I18N in qooxdoo. Another topic in that vein is Unicode handling.
Unicode is a general standard on encoding characters from natural languages all over the world. Basically, it is enumerating all those characters, starting from 0 to well beyond 1 million. On computers, these numbers are then instantiated, again using one of several encodings. One of the most popular of these encodings is UTF-8, a variable-length encoding where one character (actually, the Unicode number of that character which is called a code point) is encoded in one to up to four octets (8-bit bytes). The encoding goes like this:
Scalar Value First Byte Second Byte Third Byte Fourth Byte -------------------------------------------------------------------------- 00000000 0xxxxxxx 0xxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx 000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
(source: unicode.org)
The table shows how bits are distributed from the original code point ("Scalar Value") to the bytes representing it. There are other encodings, like the fixed-length UCS encodings. The nice thing about UTF-8 is that it is space-efficient and, maybe more importantly, fully backwards compatible with ASCII: a single-byte UTF-8 value is identical to its ASCII value (7-bit ASCII, that is). For code points beyond that a part of each byte is just tagging that byte (the leading 1* bits), while the lower portion of the byte holds the actual code point bits.
qooxdoo treats all source code files and generated output, in fact all text files, as UTF-8. This assures, among other things, that you can use all kinds of funny characters in string literals and comments in your code. Unicode characters in arguments to tr() are passed correctly to the corresponding .po files, and their Unicode translations are correctly retrieved at run time. Since browsers usually support UTF-8 they are even rendered nicely on screen
.

Comment by Thomas Herchenröder
Joel Spolsky has a nice write-up on Unicode on his web site: http://www.joelonsoftware.com/articles/Unicode.html
February 24, 2009 2:51 pm
Pingback by qooxdoo » News » The week in qooxdoo (2009-02-27)
[...] fixed a bug that treated some unicode white space character in class files as non-white space. The unicode parsing is improved now. Default jobs fix, pretty and lint now honor include/exclude settings in [...]
February 28, 2009 12:43 am
Pingback by qooxdoo » News » Generator and Unicode Application Name Spaces
[...] This is another post in a loose sequence of articles around I18N and Unicode support in qooxdoo, and qooxdoo’s tool chain particularly. (The other so far were 1 and 2). [...]
June 3, 2009 8:28 pm