Arc Forumnew | comments | leaders | submitlogin
2 points by kens 6107 days ago | link | parent

Yes, I think UTF-8 would be a disaster with modifiable strings. mzscheme uses UCS-4 (UTF-32) internally, and that would be the simplest approach. If you are willing to ignore Unicode characters > 65536, then UCS-2 would be okay with half the memory usage. When you talk about a character represented by several code points, are you talking about Unicode surrogates for characters > 65536? (Oversimplifying, two UCS-2 surrogate characters are used to represent one Unicode code point > 65536.) I think you'd be better off with UTF-32 than UTF-16 and surrogates, as surrogates look like a nightmare that you'd only want if you need backwards compatibility, see Java's character support: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character....


1 point by almkglor 6107 days ago | link

> When you talk about a character represented by several code points, are you talking about Unicode surrogates for characters > 65536?

Actually I'm talking about so-called "combining characters" http://en.wikipedia.org/wiki/Combining_character

Normalization... hahahaha unicode unicode headaches headaches! http://en.wikipedia.org/wiki/Unicode_normalization

-----

4 points by kens2 6107 days ago | link

Oh, Unicode combining characters and normalization. I classify that as "somebody else's problem." Specifically, if you're writing a font rendering engine, it's your problem. If you're writing an Arc compiler, it's not your problem. If you want complete Unicode library support in your language (like MzScheme's normalization functions string-normalize-nfd, etc.), then you just use an existing library such as ICU, and it's not your problem. ICU: http://www-306.ibm.com/software/globalization/icu/index.jsp

-----