Flutterby™! : I18n 2007-03-01 21:21:25.621335+00

I18n

2007-03-01 21:21:25.621335+00 by Dan Lyke 5 comments

Okay all you computer language geeks out there... Internationalization (I18n) has just been dropped in my lap. It's not an immediate thing, but I should be educating myself and making suggestions. I've done enough stuff with XML that I've got a good grasp on UTF-8, but I need a little guidance:

Programming language symbols in non-ASCII characters (ie: variable names in Japanese or with accents): Good or bad idea?
Extended C++ string packages? Any good ones? Do we write our own? Is the standard library's C++ string interface suitable?
Internally, do we use variable width characters (ie: UTF-8), is UTF-16 a reasonable compromise or do we still have to worry about variable width given that the full Unicode spec has 17 times 2 bytes worth of characters in it?
lex and yacc?

#Comment Re: made: 2007-03-01 22:02:32.181905+00 by: meuon [edit history]

Are you trying to CODE in UTF-8/UTF-16/Other character sets/ or just display things to users in their language? - First, all your field lengths/variable sizes for everything just got too small.. Getting Traditional Chinese to display properly on WinXP was a major pain, even just in a web browser, and I just got told we were going to being supporting some Korean at NK as well... I don't know how well Mac's/OSX does it, but I have been really impressed with Linux's multi-character set support, PHP/MySQL seem to work well with other char sets. You are writing in a much less abstracted world.

(Good Luck 吉)

First off, we have a language in which people describe their models. It's a programming language:

We have a console which displays text. Right now we're doing it one way in OpenGL, but while that way works fine for 127 characters, it won't scale to 17*65k, which would mean we need to switch to a different mechanism there. But right now we use a fixed spacing font for that tool, are we going to have to revamp our system to handle proportional spacing because we'll have users doing Kanjii?

Localization (remapping English prompts and menus and such to other languages) is probably the least of our worries. But, yeah, that's on the table too.

On programming languages, think about people in Russia who have to code in C and don't necessarily speak English. I personally think it would be nice if they could use their native language to program (but don't ask me to maintain it). The real question I would think is: does the compiler support non-ASCII characters? If it does, then I don't see a real problem.

Since I don't program in C++, can't help you with the C++ string packages, but again, it comes down to—do any of the existing ones handle Unicode properly? If not, you may have to write your own (or do a lot of fixing).

In playing with Unicode (and GNU's iconv package) I tend to use UCS-4 internally. Sure, it uses a ton of memory, but a lot of C idioms for walking through strings carry over without a lot of coding overhead. If you aren't working with large documents, you might be able to use UCS-4, but store everything as UTF-8.

As for filenames, that's really up to the underlying file system and how it handles UTF-8.

Do you want to allow your programmers to mix tabs and spaces? Indent blocks only by a prime number of spaces? Write functions with ALL_CAPS and #define mixedCaseConstants_maybe_with_underscores?

I think i18n the code itself is a style decision like the above. I'm doing i18n in Python right now (which has built-in support for UTF-8), so I have no useful input on the other questions.

The question on support of symbols is one of revamping the language parser, and all the pros and cons I can come up with on that are good. Half of me thinks that'd be easy, just switch the character size to 32 bits and surely someone's come up with classification functions like "isspace" and "isalnum" for Unicode, the other half of me realizes that I don't know what those functions are or where they'll come from yet, and whether or not our lex grammar is easily adjustable for wider characters (ie: does the lex generated code use any 256 byte arrays that are indexed by the next character?).

Filesystem support kind of drives some of these other issues, because we need to manage files in our C-like preprocessor (ie: #include "hiriganacharacters"), but that starts to say that we're going to have to manage them in the console, because we'll need to print error messages that say "Unable to open 'hiriganacharacters'"...

Yeah, the reason I've avoided this so far is that I've either just been passing UTF-8 through (C coding on XML stuff) or Perl has managed it for me.