Arc Forum | I'm for adding regular expressions and other string-handling functions. String ...

Arc Forum

2 points by map 6550 days ago | link | parent

I'm for adding regular expressions and other string-handling functions. String manipulation is one of the most common tasks in my programs. Are mzscheme's regular expressions good enough?

2 points by map 6550 days ago | link

Can this tiny Ruby program that I recently wrote for my own use be easily converted to Arc? If not, what additions to the language would make it easy to convert this program? (Please keep the string containing the dates unchanged; don't pamper the language.)

  require 'date'

  "
  2008.2.18
  2008--3--21
    2008/5/26
  2008_7_4

  ".scan( /\S+/ ){|date_str|
    date_ary = date_str.split(/\D+/).map{|s| s.to_i}
    the_date = Date.new( *date_ary )
    diff = the_date - Date.today
    if (0..7).include?( diff )
      puts "***"
      puts "*****"
      puts "*******"
      puts "\a  #{ the_date } is a holiday!"
      puts "*******"
      puts "*****"
      puts "***"
    end
  }

-----

4 points by nex3 6550 days ago | link

This is largely a library issue - Arc could certainly do with better support for string ops, regexen, and especially datetime manipulation. But here's what I came up with (note that date-days is pretty inaccurate most of the time):

  (def compact (seq) (keep ~empty seq))

  (def date-days (date)
    (+ (* 365 (date 'year))
       (* 30  (date 'month))
       (date 'day)))

  (def date- (d1 d2)
    (- (date-days d1) (date-days d2)))

  (def format-date (date)
    (string (pad (date 'year) 4 #\0) "-"
            (pad (date 'month) 2 #\0) "-"
            (pad (date 'day) 2 #\0)))

  (= str "
  2008.2.18
  2008--3--21
    2008/5/26
  2008_7_4
  
  ")

  (each date (map (fn (date)
                    (map [coerce _ 'int]
                         (compact:ssplit date [no (<= #\0 _ #\9)])))
                  (compact:ssplit str))
    (let the-date (obj year (date 0) month (date 1) day (date 2))
      (when (<= 0 (date- the-date (datetbl (seconds))) 7)
        (prn "***")
        (prn "*****")
        (prn "*******")
        (prn "\a  " (format-date the-date) " is a holiday!")
        (prn "*******")
        (prn "*****")
        (prn "***"))))

-----

4 points by map 6550 days ago | link

Good work.

I think you've highlighted at least one gap in Arc's arsenal. Using Arc2:

  arc> (ssplit " foo bar ")
  Error: "reference to undefined identifier: _ssplit"

You needed to use ssplit, but Arc doesn't have it.

I don't think the importance of string ops should be underestimated. Strings ops are just as essential as numerical ops. A language that cannot effortlessly manipulate strings is a low-level language in my book. If people are supposed to be testing Arc by using it instead of the languages they were using, Arc needs string ops. Can't they be easily lifted from mzscheme?

Remember the thread on implementing Eliza in Lisp and Ruby? No one posted an Arc version.

-----

3 points by nex3 6549 days ago | link

Oh, sorry, I should have specified: I used Anarki-specific stuff in several places. Mostly the date-manipulation, but also ssplit (I actually hadn't realized that wasn't in arc2. Yikes). Using Anarki, it should work, though.

I totally agree that strings ops are important. If I recall, PG has also said something to this effect, so I wouldn't be surprised if more of them crop up in the next few releases.

-----

3 points by bogomipz 6550 days ago | link

So ruby has a syntax for regular expressions, such as /\D+/. What I've always wondered is, does this have any advantage at all?

I mean, the actual regex operations are done by methods on the string class, which like nex3 mentioned is at the library level.

Is there any reason

  a_string.split(/\D+/)

is better than

  a_string.split("\D+")

Please do enlighten me.

-----

5 points by nex3 6550 days ago | link

A distinction between regexen and strings is actually very handy. I've done a fair bit of coding in Ruby, where this distinction is present, and a fair bit in Emacs Lisp, where it's not.

There are really two places where it's really important. First, if regexen are strings, then you have to double-escape everything. /\.foo/ becomes "\\.foo". /"([^"]|\\"|\\\\)+"/ becomes "\"([^\"]|\\\\"|\\\\\\\\)+\"". Which is preferable?

Second, it's very often useful to treat strings as auto-escaped regexps. For instance,

  a_string.split("\D+")

is actually valid Ruby. It's equivalent to

  a_string.split("D+")

because D isn't an escape char, which will split the string on the literal string "D+". For example

  "BAD++".split("D+") #=> ["BA", "+"]

Now, I'm not convinced that regexen are necessary for nearly as many string operations as they're typically used for. But I think no matter how powerful a standard string library a language has, they'll still be useful sometimes, and then it's a great boon to have literal syntax for them.

-----

3 points by bogomipz 6550 days ago | link

Ok, so what it comes down to, is that you don't want escapes to be processed. Wouldn't providing a non-escapable string be far more general, then?

Since '\D+' clashes with quote, maybe /\D+/ is a good choice for the non-escapable string syntax. Only problem is that using it in other places might trigger some reactions as the slashes make everybody think of it as "regex syntax".

-----

3 points by nex3 6549 days ago | link

Escaping isn't the only thing. Duck typing is also a good reason to differentiate regular expressions and strings. foo.gsub("()", "nil") is distinct from foo.gsub(/()/, "nil"), and both are useful enough to make both usable. There are lots of similar issues - for instance, it would be very useful to make (/foo/ str) return some sort of match data, but that wouldn't be possible if regexps and strings were the same type.

-----

4 points by bogomipz 6549 days ago | link

Now we're getting somewhere :) For this argument to really convince me, though, Arc needs better support for user defined types. It should be possible to write special cases of existing functions without touching the core definition. Some core functions use case forms or similar to treat data types differently. Extending those is not really supported. PG has said a couple of times;

"We believe Lisp should let you define new types that are treated just like the built-in types-- just as it lets you define new functions that are treated just like the built-in functions."

Using annotate and rep doesn't feel "just like built-in types" quite yet.

-----

2 points by almkglor 6549 days ago | link

Try 'redef on nex3's arc-wiki.git. You might also be interested in my settable-fn.arc and nex3's take on it (settable-fn2.arc).

-----

3 points by earthboundkid 6549 days ago | link

You could always do it the Python way: r"\D+" => '\\D+'

There's also u"" for Unicode strings (in Python <3.0) and b"" for byte strings (in Python >2.6).

-----

2 points by map 6550 days ago | link

If the "x" modifier is used, whitespace and comments in the regex are ignored.

  re =
  %r{
      # year
      (\d {4})
      # separator is one or more non-digits
      \D+
      # month
      (\d\d)
      # separator is one or more non-digits
      \D+
      # day
      (\d\d)
  }x

  p "the 1st date, 1984-08-08, was ignored".match(re).captures

  --->["1984", "08", "08"]

-----