I strongly agree. After all, there are no particular accommodations for sounds, images, and what have you. Why would there be one for verbal data?
Strings have no place in the core logic of a program. It is only because of inertia that we have this remnant from the early terminal days.
And strings bring their own set of problems (not very useful for a 100-year language) like translation, ascii vs latest unicode version, coupling (the web is finally beginning to get this right by separating behaviour from content and presentation).
In fact, many languages (or at least their large libraries) do have accommodations for sounds, images, etc. This is because most languages have some sort of "class" mechanism, and create classes for those things (in their standard library or in a very common one). And sometimes, what you are working on is a string parsing/manipulating/generating program, where strings do belong.