Yes, I think UTF-8 would be a disaster with modifiable strings. mzscheme uses UCS-4 (UTF-32) internally, and that would be the simplest approach. If you are willing to ignore Unicode characters > 65536, then UCS-2 would be okay with half the memory usage. When you talk about a character represented by several code points, are you talking about Unicode surrogates for characters > 65536? (Oversimplifying, two UCS-2 surrogate characters are used to represent one Unicode code point > 65536.) I think you'd be better off with UTF-32 than UTF-16 and surrogates, as surrogates look like a nightmare that you'd only want if you need backwards compatibility, see Java's character support: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character....
Oh, Unicode combining characters and normalization. I classify that as "somebody else's problem." Specifically, if you're writing a font rendering engine, it's your problem. If you're writing an Arc compiler, it's not your problem. If you want complete Unicode library support in your language (like MzScheme's normalization functions string-normalize-nfd, etc.), then you just use an existing library such as ICU, and it's not your problem. ICU: http://www-306.ibm.com/software/globalization/icu/index.jsp