See: this post from a long time ago.
Now let me state – I like Mark Pilgrim, I think he’s great. I disagree with him sometimes on his tone, and I think that he can be an asshole, but for the most part, I agree with him technically. But not this time.
On this, he’s saying – because of a Very Shitty RFC, (almost all) XML on the web is broken.
I think he’s wrong, and is panicking about a Very Shitty RFC (3023, if you care).
The fundamental issue is transcoding proxies. They take anything served as:
Content-Type: text/anything; charset="httpcharset"
<something><something charset="contentcharset">...
And transparently transform the text into a different charset, assuming it was httpcharset in the first place!
In case you cannot tell, this means that HTML will get mangled too. Plenty of HTML is served with one charset, but specifies another one internally. Well, that’s an easy statement to make, but I think it’s true?
Anyways, any time the content of your document has a character outside the Served content-type, a transcoding proxy will mangle it, beyond recognition.
This is not limited to XML. HTML, hell, even text/plain documents will all be horrifically mangled. Anything where the character-space of the Served text is narrower than the character-space of the content text will be messed up. That’s a lot of content to mangle, for a proxy’s sake. Especially when most clients are now character-set-aware, and doing such weird tricks to make non-character-set-aware applications function correctly seems to break everything else…
But that being said – soon after he published this article, Mark Pilgrim stopped posting his blog. Why? I liked disagreeing with your tone, often. I liked thinking, “What an asshole, but he’s right.” I even snickered at some things. It was Good. So my RSS (Atom, Mark, it speaks Atom! Don’t freak out!) feed reader still points to his nonexistent feed as a silent protest.
Come back Mark!