I'm no expert, but with Unicode strings, doesn't the "character" abstraction break down anyway, when a "character" sometimes needs to have multiple glyphs? With "characters" removed, wouldn't a a multi-glyph "character" just be a length 1+ string also? Or am I just talking out of my arse here?
If I'm right, there are many concepts behind Unicode :
A Unicode string is a sequence of codepoints (or characters).
A codepoint (or character) is a numeric id that can be represented by many ways (UTF-32 : always 4 bytes ; UTF-8 : only one byte for codepoints < 128 ; from 2 to 4 bytes for codepoints >= 128 ; ...).
A glyph is what you display on the screen : a Unicode string is displayed as a sequence of glyphs. A glyph can be a single character, or the combination of 2 or more characters.
e.g., 10 glyphs on the screen can be represented as 11 characters (the two last ones being composed in a single glyph). Depending on the underlying "physical" encoding, these 11 characters can occupy 44 bytes (UTF-32) (with a O(1) access to substrings) or, say, 25 bytes (UTF-8) (with a O(n) access to substrings), or ...
In a few words : characters have a meaning in Unicode, but they don't match well with bytes (and even with physical representation) and, sometimes, with the way things are representing on the string.
As far as I know a "glyph" has a one-to-one mapping to a character, where "glyph" means the on-screen symbol used to represent the character (not sure whether there exist multi-glyph single characters - although I do think that there are characters which when in some sequence end up being displayed in one glyph, even though they are logically separate characters).
Or do you really mean "octet" or byte, of which several are regularly used to represent a single character during a unicode transmission? In such a case.... define "string". Is a "string" a sequence of bytes, or a sequence of characters?
I believe that e.g. accented characters like é are implemented as a single glyph in fonts, but are composed of two unicode code points: the base character (e) and a modifier character (´).
This is complicated by the issue that unicode also supports the combined character as a seperate single code point, for backwards compatibility with legacy character sets. However the decomposed (normalized) form is the recommended.
> Or do you really mean "octet" or byte, of which several are regularly used to represent a single character during a unicode transmission? In such a case.... define "string". Is a "string" a sequence of bytes, or a sequence of characters?
I know I shouldn't have dipped my ignorant toe into Unicode waters :-)
Maybe a better question would be: If Arc got rid of the character datatype by collapsing strings-and-characters into strings-and-substrings, could you leave "how to represent a string" (chars vs. octets vs. bytes vs. code points vs. glyphs) out of the language spec altogether? Or would such a "clean" string abstraction conflict with having Unicode support (since Unicode is deeply encoding-specific)?
Very nice work on supporting both opt and key args.
Although if you had something like this:
(defun fn (a &o (b 'b) (c 'c) &k (d 'd))
with a usage like this:
(fn 1 'd 'e)
... how would you know whether:
1) 'd is the value of the first opt arg and 'e is the value of the second (the key arg unsupplied)
or
2) 'd is the key for the key arg, and 'e is its supplied value (the 2 opt args unsupplied)
?
Maybe I'm missing something here, but it seems to me that unless you have special syntax for keywords, you will get into trouble.
And if you have to introduce special syntax for keys anyway, it is just as well to make every single argument keyable on its symbol (even vanilla ones), and just worry about combining &o and &rest (which should then be doable).
"with a usage like this: (fn 1 'd 'e) how would you know..."
The interpretation is that d and e are the two optional args, so any caller wanting to supply a keyword arg has to supply the optionals. Recall that I said it was possible, not sensible. :) But in tool design I think we should let users hang themselves rather than guess (perhaps wrongly) that no one would ever come up with a good use for such a thing.
A second, lesser motivation is that CL works that way.
c.l.lisp just offered a much better observation: the optional args above are standard for the various "read" functions, and the start and end keywords are standard for string functions. Read-from-string then is inheriting consistently from both families.
> Having all parameters be keyword arguments as well might be interesting, but it wouldn't avoid optional/keyword confusion.
Why not?
Let's say you have a function with 3 standard args followed by 2 opt args. So you have 5 args, all keyable on the symbol you give them in the def.
Let's further say that in a call to this function, I key the middle standard arg (#2 in the def) plus the first opt arg (#4 in the def) and also, I omit the second opt arg (#5 in the def). So, I'm supplying two standard args apart from the two args I'm keying. Then the function call parser would know, after identifying the 2 keyed args and matching them to positions #2 and #4 in the def, that the first non-key arg supplied corresponds to position #1 in the def, the second non-key arg supplied corresponds to position #3 in the def, and that an arg for position #5 in the def is missing, leading to the usage of the default value for this opt arg.
This would even work when you want to raise an error for a non-supplied, non-opt arg.
Wouldn't this work quite intuitively (keying an arg when calling a function "lifts it out" of the normal "vanillas then optionals" argument sequence, shortening that sequence, put keeping a well-defined order for it)? (You would need special syntax for keys in this proposal. My suggestion is a colon appended to the arg symbol, rather than prepended, like in CL.)
Can someone give a counterexample if they think this somehow wouldn't work?
&rest args are left as an exercise for the reader :-)
I would rather have immutable strings + unification of symbols and strings.
- Any string could have an associated value, like symbols today.
- "foo", 'foo and (quote foo) would be the same object (you would allow Lisp-style prepend-quoting of non-whitespace strings for convenience).
- foo, (unquote "foo") or (unquote 'foo) would then be used to evaluate, so even non-whitespace strings like "bar baz" could act as symbols (but with less convenience, of course, since you would have to use the unquote form to get them evaluated).
- Since such a unified string/symbol would also act as a perfectly good key/value pair, a simple list of such strings will in effect act as a string-keyed hashtable (since duplicate strings in the list would be the same immutable key), and can be used wherever you need symbol tables (e.g. for lexical environments). In fact, string-keyed hash tables would be a subtype of any-sort-of-key hashtables, and probably used much more.
. And by (unquote "foo"), do you mean (eval "foo")? Or do you mean `,"foo"? The latter makes more sense here.
At any rate, I'm not convinced that this is actually a benefit. Strings and symbols are logically distinct things. Strings are used when you want to know what they say, symbols when you want to know what they are. Unifying them doesn't seem to add anything, and you lose mutability (which, though unfunctional, can be quite useful).
> Strings and symbols are logically distinct things. Strings are used when you want to know what they say, symbols when you want to know what they are.
Fine. But this breaks down anyway when you force people to use (immutable) symbols instead of strings for efficient allocation. When using symbols as keys in hashtables, you do not "want to know what they are", you "want to know what they say".
And unification would possibly have good consequences for simplifying macros and code-as-data (especially if characters are also just strings of length 1). Most code fragments would then literally be strings (most notable exceptions would be numeric literals, list literals and the like).
Actually, in a hash table, I usually don't care what the key says, any more than I care about the name of the variable used to store an integer. I care about it for code readability, but I'm usually not concerned about getting a rogue key (where I do care what it says). In that case, I would either use string keys or (coerce input 'sym).
I'm not convinced that characters being strings of length one is a good idea... it seems like the "character" is actually a useful concept. But I don't have a huge opinion about this.
Code fragments would still be lists, actually: most code involves at least one function application, and that's a list structure. Only the degenerate case of 'var would be a string.
> Actually, in a hash table, I usually don't care what the key says, any more than I care about the name of the variable used to store an integer.
That's fine again, but my point is just that by using symbols as keys in hashtables, you never care about the value part of that symbol (you just need an immutable key); you're not using the symbol "as intended", for value storage.
> most code involves at least one function application, and that's a list structure.
Yep. But in the case where that function application does not contain another function application (or list literal) in any of its argument positions, we would, with my proposal, be talking about a list of strings, which could then again be seen as a string-keyed hash table...
Symbols are not "intended" for value storage, symbols happen to be used for value storage. Symbols have exactly the same properties as, say, named constants in C, and thus fit in the same niche. They also have the properties of variable names, and so fit in that niche too. Symbols are a generally useful datatype, and they are no more intended for just "value storage" than cons cells are intended for just list construction.
A list of strings is still a list, which is not what you said; right now, it's a list of symbols, and I don't see the benefit of a list of strings over a list of symbols.
It should be possible to detect, at define-time, whether a function/macro has side-effects (see the ideas for a conservative purity analyzer in http://arclanguage.org/item?id=558).
So if the function/macro has side-effects, def/mac could e.g. also assign it to a second symbol with "!" appended.
'?' could also be done similarly, but would be more difficult. Here is a start:
def/mac should also assign its function/macro to a second symbol with "?" appended if:
1) The function/macro has no side-effects (no "!")
Leaving aside the issues of detecting a query, I think it's a really bad idea to have the compiler force (badly guessed) semantic knowledge on me.
It's my code, I wrote it, and I don't want any compiler telling me that a query that caches previous results must be imperative! and not end in a ? .
I also thing needlessly polluting a namespace with addition symbols is a bad idea. You only get one namespace in arc, try to preserve it for the future.
I think he was trying to make the point that the word "power", when used about programming languages, is either:
1) Well-defined in terms of computability, but then rather uninteresting, since all programming languages are Turing-complete.
or:
2) So ridiculously ill-defined as to be meaningless.
Instead of arguing about which language is the more "powerful", one should be arguing in terms of which sets of problems have efficient (in terms of programmer time and maybe learning curves) solutions.
A CL-style keyword argument is like an optional argument in that you need to give it a default value for when it's omitted. But in addition, with a keyword argument at function-call-time, you can switch argument order if you specify the key.
This means that Arc could treat keyword arguments as a special case of optional arguments.
This is the optional-arg example from the Arc tutorial:
arc> (def greet (name (o punc (case name who #\? #\!)))
(string "hello " name punc))
* redefining greet
#<procedure: greet>
arc> (greet 'who)
"hello who?"
But let's say you allow both "o" and "k" for optional arguments, and when there is a "k", you allow the argument symbol to be used as a key when calling the function, like this:
arc> (def greet (name (k punc (case name who #\? #\!)))
(string "hello " name punc))
* redefining greet
#<procedure: greet>
arc> (greet 'who)
"hello who?"
arc> (greet 'john)
"hello john!"
arc> (greet :punc "?!?" 'jane)
"hello jane?!?"
Of course, once you allow an argument symbol to be used as an argument key when calling, you could actually extend this opportunity to every single argument, optional or not. Then you would no longer need the "k" syntax, since every argument is "keyable":
arc> (def greet (name (o punc (case name who #\? #\!)))
And BTW, the :arg syntax for argument keys is only a CL convention.
In Arc, it would be much more natural/readable to use arg: (colon appended rather than prepended to the symbol).
[This is my comment re-posted from that thread. I hate zero-based indexing.]
In this expression:
[+ _1 _2 _3]
... the interpreter already knows that _2 is the second anonymous parameter to the function literal, without parsing the "2" token.
So you could instead have:
[+ _ _ _]
... and specify the index only if you actually want to use the parameter outside of the function literal (and there, _ would still be shorthand for _1).
And BTW, this whole thing should work with rest parameters, so that in:
Using [+ _ _ _] with the same meaning as [+ _1 _2 _3] would be confusing I'd say. [+ _ _ _] should mean the same as [+ _1 _1 _1] if it is allowed at all.
(= x2 [+ _ _])
Explicit numbering will make sure the arguments stay in the correct order so my earlier remark of allowing any underscore-beginning string is probably wrong.
... the interpreter already knows that _2 is the second anonymous parameter to the function literal, without parsing the "2" token.
So you could instead have:
[+ _ _ _]
... and specify the index only if you actually want to use the parameter outside of the function literal (and there, _ would still be shorthand for _1).
And BTW, this whole thing should work with rest parameters, so that in:
I'm just saying that if you have several anonymous parameters in the parameter list of the function literal, it is just as well that you represent them with the same token _within the parameter list_. The parameter list is ordered, so the interpreter knows which is which anyway. If you then want to refer to one of the anonymous parameters _outside of the parameter list_, you use a token WITH an index, to identify the position of the particular anonymous parameter in the parameter list.