Besides access to all the CL libraries, I think it would be worth doing for speed alone. Not everyone has (yet) used Arc in a situation where the speed became an issue, but it does happen pretty easily (just ask almkglor).
In SBCL, I am consistently able to achieve results within 30% of C programs. Without type annotations the speed is still quite respectable.
If we ever want Arc to be usable in the wild for a diverse range of applications, we will need a compiler that can generate fast code. This is one way of getting there.
Well, there are still ways to break the speed barrier. For instance the main reason I made cached-table was in order to cache the results of parsing the wikitext in Arki (but not keeping them for too long, since that would simply be a waste of memory). This is even arguably the "correct" thing to do and I might not have done it yet if Arc hadn't been so slow ^^.
"When I was a noob, I talked like a noob, I thought like a noob, I reasoned like a noob. When I became a hacker, I put noobish things behind me. Now we see but a poor implementation as in an alpha; then we shall see code to code. Now I code in part; then I shall code in full, even as I am full of code. Now these three remain: hubris, impatience, laziness. But the greatest of these is laziness."
Yes, you can always optimize your Arc code to mitigate the problem. My point still stands though. If we want to be able to use Arc in production situations, we will eventually crave an industrial-strength Arc compiler.
This becomes even more obvious if you take seriously the notion of "A Hundred-Year Language".
We ought to focus on what gap Arc is really meant to fill: quick prototyping, and easy deployment of applications. It is not designed to crunch numbers at the speed of light, or cure cancer. Web applications simply don't need the kind of speed that C et al. do. If they did, we wouldn't be using Python or Ruby either.
We should focus on improving what Arc does, rather than trying to morph it into the behemoth Swiss Army knife that is Common Lisp.
Not to discourage you from your own pursuits regarding this, but I think as a whole, the Arc community would be better off putting its focus elsewhere.
I can't speak for the whole of "the Arc community" any more than you can, but...
I'm not trying to turn Arc into Common Lisp, I'm advocating making Arc fast. There's a difference.
CL was a behemoth because it was a union by committee of all the cruft in every popular Lisp dialect circa 1980. One of Arc's goals is to design from a clean slate. There is not a pressure to conform to bad or inconsistent decisions of the past. None of that has anything to do with performance.
The question becomes: Do we want to make a real general purpose language, or do we want Arc to become "yet another scripting language"?
This is a long term question. We may not need a compiler tomorrow, but I'm not ready to say for the long haul: "Arc is for exploratory programming and rapid prototyping. It will never scale. You shouldn't write ray tracers in it, or image processing, or games. It's fine for web apps -- unless they get allot of traffic."
The funny thing is that, even in some of my webapps, I always needed some more speed, at one moment or another. The last example I have in mind : I wanted to generate a chart showing the use of a big system. As the chart's look depends on a few criteria given by the user, it has to be generated on demand. Well I had to dig into a DB containing millions of items, then mix these items together (that couldn't be done in SQL) and finally generate the chart.
Glad I had Psyco there. If I hadn't had it, or at least Pyrex, I would have probably dropped Python because writing C extensions for it is quite painful. And that's also the reason why I never really used Ruby, despite its cool features.
And I don't want to write prototypes and say : "Hmm... My code is working now, let's write it in a serious language for the production version".
Look at Scheme anyway. We really can't say it's a language focused on speed or designed to crunch numbers. Well, look at Ikarus. Oh, yes, for number crunching, you might prefer Stalin. Even C looks slow when compared to Stalin.
Of course, this shouldn't be the main focus of the community, and I don't even think the language should be designed with speed in mind (well, a little of course, or else we would have dead slow numbers implementd with lists of symbols) but that it should be seriously taken into consideration.
Yes, exactly. I think a major cause of all this railing against optimizations is all the newbies who have just learned to write programs running around shouting about efficiency. I was one of those just a couple of years ago. The problem with these newbies is that they're naive. They decide that something is fast or slow based on how efficient its most naive implementation sounds. It seems that they grasp the vague idea that optimized code tends to be longer than code that is not optimized, but rather than responding to that by not trying to optimize until they know what parts of the code are slow, they respond by assuming that approaches that take more lines of code are more efficient and therefore better.
A misplaced focus on speed is bad, and you should get it working before you make it fast, but that doesn't mean speed is a non-issue. If a program is slow enough to cause annoyance, that is a problem, and it should be fixed. Languages have to pay extra attention to speed issues. If programs written in a language are slow because of the language, and not because the programs themselves are badly written, there's something wrong with the language.
Another thing that newbies don't get is that well-built languages are usually optimized so that the more obvious and more commonly-used approaches are actually faster than that tangled mass of "optimized" code you just wrote. Profiling profiling profiling. Don't just guess.
So, the points are: performance should be a secondary concern, but secondary is still pretty high up on the list, and optimization should be based on information gathered with a profiler, not what sounds efficient or inefficient. Sorry for rambling, I hope this post contributes something to something. I just have a tendency to spew everything I have to say about a topic all in one place every now and then, even if only some of it is relevant. I guess you could boil this post down to a "me too", but only if you boiled it a lot.
Yes, but if we want quick prototyping, we also want transformation of the prototype to something we'll use, which might need to be faster than our mockup. Take for example Arki: the wikitext parser was written easily using a quick prototype on raymyers' treeparse, but in actual use, the wikitext parser was too slow. Unfortunately, there's no easy way to transform the parser structure to faster techniques (such as state machines), without doing it by hand, which rather undermines the purpose of the quick prototype. By optimizing the underlying parser, however, the prototype was still useable, and by exposing a few more particular combinations (such as nil-returning sequences, for when we only care that a syntax exists or doesn't exist, not the individual charactes of the syntax), we can refine the prototype into something faster.
I tried that. Not very easy as Ikarus' hash-tables don't work like mzscheme's (there is no "equal" hash-tables). You would have to rewrite the reader too. And I'm not sure we would get all the stuff with sockets and networking. It doesn't have an FFI yet, either.
Well, actually, I'm not sure if implementing a still-designed language in the beta-release of a compiler is a long-term solution... :) Maybe in a few months / years this could be done ? But as for now, I think porting it to CL would be easier.
If you look in the long term run the best (and most difficult) solution would be to write an Arc compiler that translates Arc code directly in machine code.
I also suggest that. In fact, looking at Chicken's implementation - stack == heap - is rather inspiring, because it shows exactly how a garbage-collected memory manager should be done: just decrement a pointer, in this case the stack pointer. Brilliant IMO. Wish I'd thought of that.
As I have remarked before, treeparse was influenced by Haskell's parsec. One of the features of parsec is the optional infusion of more meaningful error messages into grammars. One could easily imagine adding this functionality to treeparse.
Interesting. I suppose this means that (at least for the cps version of treeparse) a fail must also "return" an error value (potentially just a "Expected lit %c, not found"). Of course 'alt parsers would ignore the error message, and generate its own when all alternatives fail ("No alternatives found"). Hmm. I think if the fail continuation returned an error message, we could create a filt-style parser (for the cps case):
For the monadic version, our old nil-as-fail code would have to be changed I suspect; basically instead of the old type Return = Return parsed remaining actions | nil, we'll need type Return = Return parsed remainig actions | Failed message .
I don't think it's 'delay-parser, per se, just the interrecursion: 'formatting calls 'bolded-text calls 'formatting. I think this is what's making the profiler give weird numbers.
I am having trouble to replicating your problem with mutual recursion. I get the expected result with this mutually recursive test. Can you give me an example that reports incorrectly?
(def foo (li)
(if li (do (sleep 0.25) (bar (cdr li)) (sleep 0.25))
(sleep 2)))
(def bar (li)
(if li (do (sleep 0.25) (foo (cdr li)) (sleep 0.25))
(sleep 1)))
(profile foo bar)
(foo '(a a a a a a a))
(profiler-report)
Hmm. I'll check again later. In the meantime, what I did in the great-grandparent post was just this, on a fresh start:
1. turn on profiling in wiki-arc/wiki-arc.arc, by setting the variable whose name escapes me at the moment to t
2. Load wiki-arc/wiki-arc.arc and (thread (asv)) the server.
3. Load a single test page, with various formatting, including '''a bolded text's attempt to ''confuse the parser''''', [[link]]s, [[sample link|links with different text]]s etc. etc.
4. (profiler-report)
I'll check later on it, and possibly also post the sample wikitext.
Not sure; might actually be faster, since supposedly scheme is much more continuation-friendly than lazy-evaluation-friendly. Or not, I'm not sure
Anyway I haven't managed to wrap my head around how to implement 'seq in a continuation style; also, there's the problem of how 'cut on a parser would work in a nonlazy language - supposedly as soon as the parser given to 'cut consumes anything, 'cut will somehow dispose the 'fail continuation. I'd assume 'cut will be nearer to cut-seq, which would approximately be a cut:seq -
(def cut-seq (parser . rest)
(let next (seq rest)
(fn (remaining pass fail)
(parser
; since the name "remaining" is reused here,
; a smart compiler should be able to
; dispose of the outer function's remaining
(fn (parsed remaining actions)
(next remaining
(fn (parsed2 remaining actions2)
(pass (+ parsed parsed2)
remaining
(+ actions actions2)))
[err "parse error!"]]))
fail))))
Since fail is disposed upon continuing to next (which is the rest of the sequence) everything which fail holds could then be reclaimed. 'alt's pass is its own pass, but its 'fail contains a closure.
If we assume that the base compiler isn't very optimizing (i.e. captures closures when it doesn't need to) we could use helper functions for new 'pass functions:
I've just built a continuation-approach parser combinator library, pushed on the git. It's currently faster on 'many and 'seq, but slower on 'alt. It's considerably faster on Arki's use case, however (about 33% less time).
That said I think the main issues slowing down the monadic treeparse are:
1. Use of 'return. I'm not sure if mzscheme can optimize this away (it's just being used as a multiple-value return, and is immediately destructured by the calling function). Heck, I'm not even sure Arc destructuring is in a form optimizable by Scheme. Using continuations, this tuple becomes the parameters of the call to the continuation, and that I'm somewhat convinced that Scheme (all Schemes) are capable of optimizing. Also, by avoiding explicit destructuring, we can be reasonably sure that Arc gives the parameter list plain to Scheme (not that I've actually studied ac.scm much).
2. Copying of 'parsed. Unfortunately, we have to copy 'parsed, because of 'actions! This is because 'sem attaches the 'parsed part to the 'actions field (via closure), and so we can't reuse this 'parsed part (alternatively we could just copy 'parsed inside 'sem, instead of copying on 'many and 'seq). IMO: drop 'actions ^^. Copying 'parsed, I suspect, may be the single most time-consuming part of the monadic treeparse.
3. Using a 'tconc cell, while neater, is less efficient (about 20% IIRC) than explicit and separate head+tail variables. My goal in optimizing treeparse was to maintain legibility while adding speed; my goal in cps_treeparse was to optimize, legibility be damned^H^H^H^H^H^H^H consigned to hell.
4. There's a few more bits and pieces that may be optimizable - for example, most of my 'filt usage is (filt [map something _] parser). If we are sure that 'parsed will not need to be reused (assured by dropping 'actions, or giving the responsibility of copying 'parsed to 'sem) then we can define a (filt-map something parser) which returns the same list as 'parser, but with their car's transformed by 'something:
Using filt-map would reduce consing, and especially in the cps case would remove the scan through 'parsed to find the 'parsedtl.
Further optimizations abound: another common idiom I found in my wiki parser was (filt nilfn (seq-str "something")), which basically ignores the 'parsed return value of 'seq, It may be possible to create a variant of 'seq, 'seq-nil, which wouldn't bother catenating the 'parsed return values of subparsers.
So no, I don't think continuation-passing per se causes most of the speed difference, except possibly for eliminating the 'return structure.
Another possible issue is the use of a list-like structure. 'scanner-string also conses on each 'cdr, and the total memory usage actually exceeds (coerce "string" 'cons) - it's just that scanners are sexier ^^.
For this case it may be possible to create a variant of the CPS style which takes a tuple of (string, index, end) - as explicit and separate variables of course, so that underlying scheme has a shot at optimizing it away - and use indexing, which I suspect is faster on strings (and which scanner-string ends up doing anyway). This would remove the need for consing too.
Great to see someone using treeparse. Using scanner-string to view strings as lists is an approach I had never considered. As nex3 said, scanners are a surprisingly useful abstraction, showing yet more uses for lazy evaluation.
That said, the formatter in wiki-arc is plenty darned slow on & codes (I used a really long 'alt form). Lacking a profiler, I can't really tell whether it's treeparse or scanner-string which is slow. Any suggestions on optimization? I think one problem is that each parser returns a list of stuff, without any chance to reuse that list (especially for 'seq forms). Another problem might be that scanner-string is just too slow for easy usage.
I haven't fully digested the wiki parser yet. As a first thought, the use of enclose-sem confuses me a bit -- seems like a reinvention of filt. I doubt that would be a performance issue of course.
Maybe the grammar should be factored, or maybe this is an opportunity to optimize treeparse.
Try this: convert the string to a normal list of characters in advance, and see if the performance improves:
(map idfn (scanner-string "hello world"))
If that doesn't help, then scanner-string is not the problem.
Hmm, sorry, I didn't fully digest 'filt either. I thought 'filt was for transforming the 'parsed field in the return value? 'enclose-sem is intended to act on the 'actions field in the return value (although I did end up passing the entire return value).
The main slowdown occurred on adding & codes. Here's my most degenerate case:
Takes about 5-6 seconds to render; also, if I do some thing on the repl (such as searching using (help "searchstring"), or loading some random library) the parsing takes up to 12 seconds. Anyway I've added a rendering time at the lower right of each rendered page. Note also that I've added caching, to disable caching you'll need to search through wiki-arc.arc for *wiki-def and change the (cached-table) call to (cached-table 'cachetime 0).
(def many (parser)
"Parser is repeated zero or more times."
(fn (remaining) (many-r parser remaining nil nil nil nil)))
(let lastcdr (afn (p) (aif (cdr p) (self it) p))
(def many-r (parser li acc act-acc acctl act-acctl)
(iflet (parsed remaining actions) (parse parser li)
(do
(when parsed
; edit: necessary, it seems that some of the other
; parsers reuse the return value
(zap copy parsed)
; end of edit
(if acc
(= (cdr acctl) parsed)
(= acc parsed))
(= acctl (lastcdr parsed)))
(when actions
; edit: necessary, it seems that some of the other
; parsers reuse the return value
(zap copy actions)
; end of edit
(if act-acc
(= (cdr act-acctl) actions)
(= act-acc actions))
(= act-acctl (lastcdr actions)))
(many-r parser remaining
acc act-acc acctl act-acctl))
(return acc li act-acc))))
Basically instead of using join, I used a head+tail form of concatenating lists. It seems to work, and the optimization above seems to drop the test:
((many anything) (range 1 1000)))
down to 27 msec (edited: 58msec) on my machine (it was about 7350msec on the older version)
What are your thoughts? The code now looks unprintable. Also, I'm not 100% sure of its correctness.
UPDATE: yes, it's not correct, however the edited version above seems to work now. Rendering of my "difficult" page has dropped to 1100msec.
I've since added a 'tconc facility to Anarki. Basically tconc encapsulates away the head+tail form of list catenation; a single cons cell is used with car==head and cdr==tail.
The head of the list is the start of the list, while the tail of the list is the last cons cell:
cons
O->cdr
v
car
the list (1 2 3 4 5):
O->O->O->O->O->nil
v v v v v
1 2 3 4 5
the tconc cell for the above list:
tconc cell
O-----------+
| head | tail
v v
O->O->O->O->O->nil
v v v v v
1 2 3 4 5
'tconc creates a new cell and modifies the tconc cell to repoint the tail to the new tail. You can extract the currently concatenated list by using 'car on the tconc cell.
The diff between my version of treeparse and yours is now:
--- treeparse.arc 2008-03-21 11:59:13.000000000 +0800
+++ m_treeparse.arc 2008-03-22 23:00:51.000000000 +0800
@@ -23,4 +23,6 @@
; Examples in "lib/treeparse-examples.arc"
+(require "lib/tconc.arc")
+
(mac delay-parser (p)
"Delay evaluation of a parser, in case it is not yet defined."
@@ -112,12 +114,12 @@
(def many (parser)
"Parser is repeated zero or more times."
- (fn (remaining) (many-r parser remaining nil nil)))
+ (fn (remaining) (many-r parser remaining (tconc-new) (tconc-new))))
(def many-r (parser li acc act-acc)
(iflet (parsed remaining actions) (parse parser li)
(many-r parser remaining
- (join acc parsed)
- (join act-acc actions))
- (return acc li act-acc)))
+ (nconc acc (copy parsed))
+ (nconc act-acc (copy actions)))
+ (return (car acc) li (car act-acc))))
(def many1 (parser)
edit: note that use of 'tconc/'nconc is slightly slower than explicitly passing around the tails. For the test, it runs at 79 msec on my machine (explicit passing ran at 58msec); this is expected since we must destructure the cons cell into the head and tail of the list under construction. Would it be perhaps better to use a macro to hide the dirty parts of the code in explicit passing of hd and tl?
Nice optimization. I'm not so sure about the naming of nconc, though. Although it is used for a similar purpose as the traditional CL nconc, I would expect anything called nconc to behave like this:
(def last-list (li)
(if (or (no li) (no (cdr li))) li
(last-list (cdr li))))
(def nconc (li . others)
"Same behavior as Common Lisp nconc."
(if (no others) li
(no li) (apply nconc others)
(do (= (cdr (last-list li)) (apply nconc others))
li)))
>> do you think this optimization is worth putting in treeparse?
Certainly. At the moment you are probably 50% of the treeparse user base, so it needs to be fast enough for your use case :)
I admit that efficiency wasn't a big thought when I first wrote treeparse (besides avoiding infinite loops -- hopefully those are gone now...). I fondly remember my CL optimization days... we've gotta make ourselves one of those nifty profilers for Arc.
It seems that 'many is the low-hanging fruit of optimization. I've since gotten an 8-paragraph lorem ipsum piece, totalling about 5k, which renders in 3-4 seconds (about around 3800msec).
Hmm. Profiler.
I'm not 100% sure but maybe the fact that nearly all the composing parsers decompose the return value of sub-parsers, then recompose the return value, might be slowing it down? Maybe have parsers accept an optional return value argument, which 'return will fill in (instead of creating its own) might reduce significantly the memory consumption (assuming it's GC which is slowing it down)?
Note that the timing will not be very accurate or particularly useful IMO, since it doesn't count recursion but does count calls to other functions. Sigh. We need a real profiler ^^
Hmm. It seems we can't squeeze much performance out of 'alt, I can't really see a way of optimizing 'alt itself, so possibly we should optimize the grammar that uses 'alt.
True, filt doesn't touch the actions field, while sem does. However, I am usually able to replace the use of actions with filters that operate on the parsed field. I prefer this, because the filter style is a more clean and functional style -- rather than relying on side-effects. Hopefully that made sense.
I don't yet know for certain whether filters could or should be used in this particular case. enclose-sem might be the right way to go after all.
I'll have to defer to you on this one - I've only written a parser combinator type parser once before, and that was before I learned what it was. I did end up using something nearer to filters (i.e. acts on the returned value instead of having a separate 'actions field).
Edit: Perhaps part of the optimization of treeparse could be to eliminate the 'actions field. None of your samples used 'sem, and I might prefer 'filt for such enclosing stuff as ''bold'' and '''italics'''. The [[link]]s might be more difficult (because of the necessity of adding alphadigs after the closing bracket in [[link]]s to the actual link text but not the link target), but I think it's doable.
I've thought several times of removing sem in favor of filt. As features, they are very similar but filt is usually cleaner. If I don't see a compelling use case for actions soon I may indeed remove them. This would simplify the interface considerably.
Just between us, filt is actually an analogue to Haskell's monadic lift operator. They're even anagrams! (this happened by accident.)
Okay, I've since ported the Arki wikiformat parser to use filt instead of sem. Removing 'actions would reduce slightly the memory consumption of treeparse.
Doesn't help. I inserted (map idfn ...) on the value given to enformat, which is really the currying of enformat-base (the very short (fn ) at the end):
Hmm. I tried refactoring the (apply alt (map str-seq ...)) thing to an alt-str-seq, basically I created a trie and then did alt on each key, with each sub-trie another 'alt and stored sequences converted to 'seq, and stored end-of-strings (nil keys) as 'nothing, but it was buggy and didn't improve performance T.T.
I've been doing quite a few attempts at improving performance but often I ended up getting more bugs, waa!
Hmm. I just tried replacing the scanner. Just as you also found, it didn't help performance.
(def scanner-string2 (s (o start 0) (o end (len s)))
(map idfn (scanner-string s start end)))
I'll see if I can find time soon to digest the wiki-arc grammar. Possibly it could be optimized, but I suspect treeparse could be handling it better. There are probably lessons to be learned from how Parsec gets good performance.
It seems that part of Parsec's optimization is that it is actually limited to an LL(1) grammar (whatever that means) and it's <|> ('alt) will fail immediately if the first parser ever consumed anything at all. Not sure how that translates to treeparse.
LL(1) grammars only require one token of look-ahead to parse.
Parsec does not strictly require this, it can handle infinite look-ahead grammars. However, for good performance, it is best to use LL(1) grammars -- so there will be no backtracking required.
When using Parsec, I have often been surprised by the quick-failing behavior of <|> that you mentioned. Thus, I did not duplicate it in treeparse.
And to expand on that, the answer is generally yes. MzScheme is a great version of scheme for its completeness, not its speed. However, several schemes are directly competitive SBCL and CMUCL. Arc could be ported to one of these without too much effort.
As for speed, Stalin Scheme is amazing for example. Not very complete and well documented for example. However, CL is full of "efficiency hacks". Scheme is full of purity, not always easy to implement efficiently. Optimizing CL is easier than Scheme, IMHO.
Why do you think CL is easier to optimize that Scheme? This isn't a hostile question; I'm just curious. Intuitively, I feel like "purer" languages should be easier to optimize.
First, CL has in the standard many possibilities for making declarations reagrding optimization : for the functions you want, you can compile code, declare types (e. g. this var only holds fixnums), declare you want to optimize the speed and ignore type safety, etc. This way, you end up writing code the way you would write it in C. There is no such thing in the Scheme standard. Individual implementations could, of course, but as far as I know no one does.
Abother example is call/cc. This is a very interesting beast, only existing in Scheme. But it is hard to implement efficiently.
The last example I can think of is 'nil. Using nil as false and the empty list is very interesting in this regard : you can implement nil as the NULL pointer, which is also 0, the false boolean. Less manipulations to do on the bare metal. Distinguishing between #f and '(), on the contrary, implies making more tests at the lower levels.
re: call/cc - I think a bit of the lambda the ultimate series of papers eventually boils down to the realization that a machine language jump-to-subroutine is equivalent to a call/cc, and the target of the call/cc just has to access the return address on the stack as a function address.
I don't get it. There's the whole stack copying for call/cc, so call/cc is much more expensive.
(I read the "Lambda the ultimate GOTO" paper you referenced earlier; it's about goto vs structured programming, not call/cc. As an aside, it's very interesting to reflect on just how controversial structured programming was.)
Implementing call/cc efficiently has been well-reasearched in the Scheme community. For a very well-written account of a non-stack-copying implementation see
Then continue at ReadScheme at "Compiler Technology/Implementation Techniques and Optimization"
to see further developments (Look especially for Clinger's papers).
In a function call, the state (the current computation being done) is saved anyway, and therefore "copied" if that is your preferred term. So ideally, call/cc should have the same overhead as an ordinary function call; the only difference is that in call/cc the continuation state is the value given to the function, while in a function call it's just one of the values given to the function.
Note however that much of the theoretical analyses of call/cc make a basic assumption of a "spaghetti stack", which would mean that partially unwound stacks would be saved implicitly as long as any continuation exists which refers to that stack, and all stacks themselves are subject to garbage collection. Most machines don't actually have a spaghetti stack and can't make a spaghetti stack anyway ^^. That said a spaghetti stack could be implemented as a simple list, with push == cons and pop = cdr.
Alternatively store the local variables on a garbage-collected heap, and include a reference to the local variables with the continuation/return address (you'll probably need to save the pointer-to-local-variables anyway, since the target function is likely to use that pointer for its own locals). Again, no additional overhead over plain function calls, except perhaps to restructure the return address and the pointer-to-local-variables.
Don't know about MIT Scheme, but if I were to implement call/cc on stock hardware and compiling down to machine language that's what I'd do ^^
I'm still totally not understanding your claim that call/cc should have the same overhead as an ordinary function call.
I read the Clinger "Implementation Strategies for Continuations" paper and they found call/cc about 10 times slower than function calls on the tak/ctak tests. I tried those tests on PLT Scheme and the overhead I saw is even worse: .7 seconds with function calls vs 51.8 seconds with continuations on (tak 24 16 8).
Clinger says about the stack strategy: "When a continuation is captured, however, a copy of the entire stack is made and stored in the heap. ... Variations on the stack strategy are used by most implementations of Scheme and Smalltalk-80."
Components of function call: (1) put arguments somewhere (2) put return address somewhere (3) jump to function.
Components of call/cc: (1) put return address as argument somewhere (2) put return address somewhere (3) jump to function.
That said, continuations generally assume that "put X somewhere" means somewhere != stack. Yes, even return addresses are not intended to be on the stack when using continuations. The main problem is when compiling down to C, which always assumes that somewhere == stack. If you're compiling down to machine language where you have access to everything, then you can just ignore the stack, but not in C, which inherently expects the stack.
Ah, I get it. I suppose typing is one of the big issues affecting speed. If the language standard insists on dynamic typing, there might be no way to get certain kinds of optimizations.
And yeah, call/cc is probably always going to be a bear. But man is it cool. :)
I suppose this goes against what a few of us (including me) were saying above -- that the language and the speed are really separate issues. Or maybe it's more coherent to say that language standards (as opposed to "languages" generally understood) can have a profound effect on speed. If they don't give implementors a lot of choice, they can box people into certain corners.
It's interesting that Scheme actually mandates optimization in at least one case (tail recursion). I don't know how many language standards make those kinds of demands, but I suspect there aren't many.
Tail recursion is interesting as it is not especially an optimization for speed but as a way to make programmers rely primarily on functional programming : if you don't have it, functional programming is rapidly a dead-end as you can make the stack explode really fast. As a bonus, it is faster :)
To show more of what is possible, "lib/treeparse-examples.arc" shows how to define the 20-line hcase2 macro, a pattern matching case statement that extends defpat with guard clauses and some infix sugar.
(def union (< xs ys)
"Merge two sorted lists, discarding duplicates."
(hcase2 `(,xs ,ys)
xs () = xs
() ys = ys
(x . xt) (y . yt)
|| x < y = x : (union < xt ys)
|| y < x = y : (union < xs yt)
|| t = x : (union < xt yt)))