I believe that e.g. accented characters like é are implemented as a single glyph in fonts, but are composed of two unicode code points: the base character (e) and a modifier character (´).
This is complicated by the issue that unicode also supports the combined character as a seperate single code point, for backwards compatibility with legacy character sets. However the decomposed (normalized) form is the recommended.
You need some way to get to the numerical code-point value of a character, to be able to implement string library functions like casing. You dont need a seperate char data type though, the function could operate on a string and just return the code point of a character (given by index) as an int.
If links stops working after a while, I'm pretty sure people will stop linking. And if not, visitors following the links will get a bad experience.
You may be right that it wont hurt page rank directly, though.
I suppose you never have to use closures, since you can always rewrite to use defop. But a major selling point of Arc seem to be the conciseness of building flows using macros like w/link. If this approach turns out to be not recommended for "real world use", I think it defeats the purpose and it would be fair to say that Arc itself fails the Arc-challenge.
I'd much rather change the underlying implementation of w/link to be more robust, if possible.
(Btw. it is only in the context of links I think long-lived closures are a problem. In the context of responses to form posts I don't think there is a problem, since these are not bookmarkable or indexed anyway.)
Cool. I don't think it should be an option though, since the server generates utf-8 anyway - it just doesn't label it correctly. I can't imagine when it would be useful _not_ to indicate the encoding.
Not indicating the encoding leaves you vulnerable to an XSS attack. For instance, the following looks harmless, but if you don't set the encoding explicitly it can get executed if your browser is set to UTF-7, or auto-detects to UTF-7:
+ADw-script+AD4-alert('XSS')+ADw-/script+AD4-
Edit to add some explanation: if displayed as UTF-7, the above will pop up a "XSS" alert box. It's just an example; it doesn't actually do anything bad but it shows the potential for malicious XSS. A key point is that HTML-escaping your output or filtering out HTML tags isn't enough, since innocuous-looking characters can cause problems if the encoding is misinterpreted.
Another option is to have all necessary information about the current operation in the URL. This is highly scalable, since you don't need to keep track of anything user-specific on the server(s), and the navigation supports branching and back/undo just like you describe.
Strangely enough PG specifically disallow this approach in his competition!
Most web apps need _both_ global session state and URL-based state. As others have pointed out, if you browse a product catalog, you would like to be able to branch into different browser windows or use the back-button. However, when you add an item to the shopping basket, you want it to be a global state change (you want have the same shopping basket in all windows), and you don't want a buy to be undone by clicking back.
Continuations are only an options for handling URL-based state, not for handling global state. And for page state they have some limitations.
For example, if all navigation is handled by continuations, you basically have to store a continuation for every hit indefinitely, since you dont know if the user have bookmarked the URL. If you don't want to store the continuations forever, you should only use them on pages that are not bookmarkable anyway, i.e. pages that are the response to form posts. But then the stated advantages, like the ability to branch and use the back button is moot, since you cannot do that anyway with form responses.
Continuations are really nifty for quick prototypes of web apps, but for production use, I believe they are a leaky abstraction.
Unicode breaks in the hello-world webapp. E.g. if you write
(defop hello req (pr "hello world \u1234"))
You get some strange looking text in you browsers. This seem to be because arc is generating UTF-8 output (which I think is MzScheme default) but not declaring the encoding, which will make most browsers default to interpret it as iso8859-1.
It seem to be fixed by changing svr.asc line 105 to
It is not really more work to make strings sequences of (at least) 24bit values rather than sequences of 8bit values. Actually it makes a lot of things simpler, since all strings can be in the unicode character set, rather that a host of different and incompatible 8bit character sets, which is the case in non-Unicode languages.
The difficulties languages like Python and Ruby has is because of backwards compatibility - a lot of existing code expects strings to be 8bit byte arrays. Java and JavaScript got this more right by using 16bit chars. It is still not enough for the full Unicode set, but at least they don't have the problem with strings in multiple incompatible character sets.
It works on account of Arc using the underlying MzScheme string implementation. Since this is incidental to the host and not part of the arc spec (arc.arc), it is not guaranteed to keep working.
A patch to support unicode (which PG has asked for) would have to include a "native" implementation of strings in Arc, which is a rather fundamental extension to the language, and I suspect the language designers would want to do this themselves? Or would you (the Arc language designers, if you read this) accept such a patch?