Automagically-translating chat thingy

blog entry posted by lalo (Lalo Martins) on 2008-12-19 20:23:00

Tags:

Usually, I have to communicate with the people in the building's management office via Google Translate. It works, but it's awfully painful to be constantly flipping the language drop-downs back and forth. (It's two drop-downs, one for source and one for target language.)

So I wrote a little javascript gadget that does the hard work for me, and also keeps a “log” of the conversation. You can peruse it at http://lalomartins.info/transchat.html

(Attention though: this is not a chat app, not in the modern sense. It's “chat” in the old-school sense, of actually talking to a person that's in front of you. It's... an interpreter widget, not a chatbox :-) enjoy and spread if you wish...)

XML considered harmful, or,

blog entry posted by lalo (Lalo Martins) on 2008-10-25 15:37:00

Tags:

I have, on a number of occasions, stated that XML is harmful, and should be taken out and shot. So here I am today, to explain why I think that, and offer alternatives.

Not good for humans

The main problem is, of course, that XML was never intended for humans. It's not designed so that we can efficiently write it, read it, understand it at a glance, or maintain it. But many tools that use XML today tend to forget that, leading to hours of wasted time and lots of frustration. (XML for configuration files, anyone? Zope's ZCML and .Net's configs and all those Java frameworks?)

Then, of course, that's not XML's fault; it was never designed to succeed at that task. The fault lies with developers who misuse it. Well, yes and no. The reason people misuse it is because it's overhyped; XML is the new peanut butter (or garlic butter, according to Pete Abrams) — adding it to anything makes it taste better and sell more. (I don't even like peanut butter.)

Not good for machines

What it was designed for is communication between programs; an unified, extensible format for data transmission. By having libraries to handle it in most languages and environments, you'd make it easy for developers to deal with it, and as a consequence, to make their programs communicate.

However, after roughly ten years of working with it, it is my informed opinion that XML fails at that, too. I'm not saying it got supplanted by better technology which we invented later. It did, to be fair. But what I'm saying is that it was wrong from the beginning. And if it's not good for us and it's not good for our programs, why are we still using it? (Peanut butter, I know.)

So let's try to break out of the hype and prove that it's bad for our programs.

The perceived problem with XML can be summarised in one sentence: XML is costly to parse. But that's too superficial; let's go deeper, look at the specifics, and the flaws in philosophy/design that lead to this perception.

Parsing XML: layers

I usually tell my co-workers that there's two “layers” to parsing XML. While that is true, it's only true in the context of our data; if I were to make that statement more generic, I'd say: there's always at least two “layers” to parsing XML.

The first, the “bottom” layer if you want, is syntactic parsing. This means reading XML itself: tags, entities, attributes, comments, CDATA, PCDATA, white space, the works. The input to syntactic parsing is a string or stream of bytes; the “output” is an API — SAX, DOM, ElementTree, you name it.

On the opposite end of the stack, the “top” layer so to speak, is semantic parsing, or extracting the data you're actually interested in. The “input” here is a generic API; in the typical case of two layers, the API from syntactic parsing. The “output” is a domain-specific API or, more commonly, a collection of structured data (usually objects, nowadays).

An example where you may have more than two layers is when you're using something else built on top of XML; the most common case being feeds. So at the bottom layer something will parse XML, then another chunk of code will parse that as RSS or Atom, and then your semantic layer will actually extract the data. At work, we initially made our data available as RDF; so we had a second, “middle” layer (we actually used a JavaScript RDF library) which would parse the RDF, and then we did our semantic parsing by using the RDF library's API. That made our code a lot simpler, but it also made it a lot slower; so we later switched to ignoring the RDF and simply treating it as XML. (Even later, we switched to a JSON format.)

Syntactic parsing: too much structure

Syntactic parsing is what XML is supposedly “all about”; the point being, you don't see it. In our case, at work, it's done by the browser (which gives us DOM with a touch of XPath). In pretty much any other case, it will still be done by your environment (the browser, in our case; JBoss and .Net are other examples), or by a standard library.

Well, that's great, right?

It is, yeah. But it hides the fact that those libraries (even if it's “hidden” in the environment, it's still at some level done by a library) tend to be huge and ridiculously complex. The XML syntax is designed to cover an enormous universe of cases that your program will concretely never encounter, and yet, you have to pay the complexity cost for them.

Semantic parsing: not enough structure

XML shines on xHTML: a markup language for text, where you have arbitrary streams of text sparkled with special instructions about it. Some of those “instructions” are really containers, which have more text and instructions. XML does that really well.

It shines a little less on something like SVG, where it represents arbitrary streams of heterogeneous objects. Some of those contain other objects, and XML does help there.

But the truth is that, for representing your program's data? It probably sucks. Its model is very different from the object model of most (all?) popular languages and frameworks today. In the end, we find ourselves designing our data structures as many as three times: once in the language in which we're actually writing it, one in a relational database, and one in XML. The mappings between them are often poor, since the semantics of the three models are so poorly matched.

Sadly, it would be relatively trivial to pick a lowest-common-denominator model that would fit all of today's popular languages. But XML didn't even try.

That's not the whole of my objection, though. Due to the MASSIVE FAIL in the syntactic layer, we get a semantic layer that's only marginally simpler than it would be to parse a DSL (domain-specific language); maybe less simple, if you use a good library for your DSL. There are about half a dozen XML APIs in wide use; smart people are frequently getting annoyed at the ones already there and coming up with a new, better one. And although a modern offering like, say, ElementTree can be light-years ahead of SAX or DOM, it can't help being clumsy and feeling unnatural to the language; at the bottom line, what it's doing is dressing up a rotting corpse.

Conclusion

Here's a better phrasing then, for the problem of XML as I see it:

XML has too much structure where it doesn't help, and not enough where it matters. One of the reasons I love JSON is that it's not designed to mark-up text, or to transfer “streams of data”; it's designed to transfer objects (JSON means “JavaScript Object Notation”), which means it maps nicely to my code on both ends, whether that code is JavaScript, Python, C++, or even C. (It maps nicely to Java as well, but who cares.)

Alternatives (existing and ideal)

Right now, for real-life code, most places where you're using (or thinking of using) XML would probably be better served with JSON. A few more complex cases may justify a DSL, but I would hesitate a lot before going down that route.

Ideally, I'd like to propose a new format; an “active” derivative of JSON, inspired by the modern practise of “JSON with callback”. Essentially, I'd like to replace JSON's “flat” object notation ({'attr1': 'value', 'attr2': 'value'}) with something which looks like a Python constructor (MyClass(attr1='value', attr2='value')). The pseudo-classes (or pseudo-functions if you're looking at it from C) would play the role that tag names play in XML elements, which would make it even more straightforward to map this data to actual objects on each end.

This would, of course, lose the benefit that “JSON with callback” can simply be executed in a browser. But then again, “JSON with callback” is not formally correct JSON anyway, so we already sacrificed some portability for that ability. “Real” JSON is usually converted to “JSON with callback” by a simple routine on the server side. A similar transformation could convert the format I'm proposing into JavaScript; the fragment above would become: MyClass({attr1: 'value', attr2: 'value'}).

Mass unblocking in the Great Firewall of China

blog entry posted by lalo (Lalo Martins) on 2008-08-02 22:13:00

Tags:

Seems a batch of sites got unblocked. Wiki.edia (marvel as I blog in regular expressions) is accessible (again), Wikibooks, Reuters, CNN, and a lot more.

Still blocked: blogspot, livejournal, wordpress (no surprise here -- lots of political blogs), BBC, certainly more; most importantly, Sinfest and CRFH :-( (why the f* is CRFH blocked? Zombies? Satan?)

Also, the web feels slightly faster in general!

The Ballad Of Sir Href

blog entry posted by lalo (Lalo Martins) on 2008-03-26 16:03:00

Tags:

It goes like this...

Take a page from Arthur's book
Join this body august most
Take seat at the round table
Where no knight a row will brook
Remain strong but never boast
Proper form makes you able
The Elysium Fields to fill
And submit all to your will
With your style and your skill.

Rediscover the web. Again.

blog entry posted by lalo (Lalo Martins) on 2007-10-28 23:01:00

Tags:

I'm absolutely in love with Prism. It's like being set free.

See, the trick here, the main trick, is not running sites in separate, stripped-down windows. Rather, it's that each of those is a separate process, and these processes don't step on each other's feet. They're very light on memory, and if they do have all the Firefox 2 memory leak issues, I'll never know, because I have no reason to keep them open for extended times -- since they load incredibly fast.

The funny thing here is -- many people list tabbed browsing as the major Firefox innovation, or at least one of the top few. What we usually don't think about is how much that became a cage. Since all the sites I'm browsing run in the same process, any memory leaks or crashes become a big problem. I don't know about everybody else, but at least myself, and a few other "power users" I know, tend to leave 10+ tabs open all the time. When some of those tabs are heavyweight web apps like GMail... everything can get very slow very fast.

When I installed Prism, the plan was to leave GMail open on Prism all the time, so if I want to restart the browser, it won't affect my mail. But see... I don't need to have GMail open all the time. I did, because it takes a while to load, and because it was a bit inconvenient to get some new mail notification that integrated well. (If GMail runs in a tab, how am I going to know if it's already open or not?) Now I don't have it open all the time anymore, and I have XFCE's mail notification plugin on my panel, which launches the GMail webapp on Prism. It's brilliant; the speed and flexibility of a desktop app.

Absolutely worth trying.

older posts