A few years ago, I was kinda into XML. Sure, it’s bloated, but the idea that you could arbitrarily represent any kind of data in it seemed cool to me. And then – if you were to try and compose two types of data that no one had ever thought of before – you could support even that, with namespaces. You could even have two conflicting elements <foo> – by specifying which one is in which namespace – <a:foo> vs. <b:foo>. Neat. Now you can really have some nuance and power in your document. Mind you, I’ve never seen this feature used – and yet it bloats the XML specifications and implementations horribly – but it seems like it could be important. Right? Never mind the fact that just about any time you grab an XML document you probably already know exactly what it’s going to look like. Shh. You’re not thinking big enough. Here’s even an article I wrote in 2005, sad that Web Pundits were going to start moving away from XML. And here’s another one from early 2007, again complaining about the inevitable HTML5. I was totally and completely wrong. I mean, I’d like to say something like “while I still believe that blah, I have to admit that I may have been mistaken…” No. Totally. Dead. Fucking. Wrong. Maybe XML’s heart was in the right place (there! I did it! some sort of backpedaling statement!), but the devil’s in the details, and XML’s details have more devils than you can shake a stick at. Several sticks.
You see my friends (you can tell I watched the debate last night, right?), I just finished working like, maybe 10 hours straight on writing a SAML receiver in PHP for my former employer. That wouldn’t be so bad, except – I’d already written one. It worked fine. For SAML 1.0. Now I had to make it read SAML 1.1. Easy, right? Read the spec on SAML 1.1, implement the changes, all done. No. SAML assertions are XML documents. XML documents that need some kind of security thingee so that people can’t forge them or tweak them. So you need XML Digital Signatures. But XML is so crazy and fluid – you could have two documents that logically mean the same thing, but their bytes don’t match! How do you compare them? Easy, my friends! You canonicalize them using the XML Canonicalization spec(s), then you sign them. SAML 1.1 “improves” this process using a “better” method of canonicalization. If you read lots of sarcasm in my angry sarcasm-quotes, you read correctly. Back to canonicalization in a moment.
Now if we’re going to sign a document that’s XML, and since everything that has ever been of any merit at all is XML and must be XML, then our signature should be XML too. But if we’re injecting bits of XML into our document to sign it, doesn’t that change the document that we’re signing? We need some way to indicate which subset of the document corresponds to the signature, and which way corresponds to what-you’re-signing. I know, I know! How about a nice simple regex to do that! Or just a straight subset of the document – cut from here…..to here? Hahahahaha…just kidding! That’s not XML! No, we have to use XPath, a way to query for arbitrary “node-sets”. And it’s, of course, XML.
So this is the ridiculous technology stack I have to go through in order to implement this relatively simple request – “let us accept SAML assertions to do single sign-on stuff.” So of course PHP doesn’t support any of this crap – because this crap is crap. Only IBM and Sun and other Big Company Weenies implement this garbage. PHP’s a working-man’s language, it supports things that are useful or interesting. There’s some sun-sponsored SAML 2.0 stuff in the works in PHP, but we need 1.1. PHP’s XML support has historically been spotty – and I don’t blame it, the XML-approved API’s are the worst API’s ever. Ever. Well, I think I had looked once at a PHP library for DNS that may have been worse. But still, very bad. So I had to cook a lot of this stuff up myself. It sucked. And the specs are, quite frankly, just wrong. Or so grossly unclear that they could never be right. And I’m no moron – I’m a big freakin’ super genius type, and I can’t implement whatever the hell they’re talking about. So there’s no chance for lesser programmers. And because people are abandoning it in droves, there’s tons of half-implemented xml packages, and digital signature packages, and XML canonicalization packages sitting out there, in various states of disrepair and malfunction. All in different languages. I had to learn bits of Python and was on track to start trying to learn Java if I hadn’t gotten myself out of some serious holes.
Here’s some fun notes: Here’s the default XPath (make sure to capitalize that P!) that should extract a signature: <XPath xmlns:dsig="&dsig;"> count(ancestor-or-self::dsig:Signature | here()/ancestor::dsig:Signature[1]) > count(ancestor-or-self::dsig:Signature)</XPath>
Oh, whoops! Except that doesn’t work. That’s just what’s in the spec. No reason it should work. Let’s expand the dsig entity – <XPath xmlns:dsig="http://www.w3.org/2000/09/xmldsig#"> count(ancestor-or-self::dsig:Signature | here()/ancestor::dsig:Signature[1]) > count(ancestor-or-self::dsig:Signature)</XPath>
Uhm, nope. That “here()” function doesn’t actually exist, you see. So I gotta make my own. Fast-forward two hours or so – hell, probably more – and many, many iterations, to get: <XPath xmlns:ds="http://www.w3.org/2000/09/xmldsig#"> (//. | //@* | //namespace::*)[not(ancestor-or-self::ds:Signature)] </XPath>
Now, shit, that *was* pretty obvious – I don’t know how I missed it. Say, though – maybe it’s just me, but maybe we’re using XPath in a way that wasn’t intended? You can tell by the fact that we have to grab all attributes, namespaces and tags at the start, unioning them together, then…doing I don’t really know what to them to ensure…something about their ancestry. Horrible. Really, really horrible.
XML Canonicalization was the bane of my existence when I made the SAML 1.0 receiver, and it returned with a vengeance this time. The concern is that some XML processors may shove nodes around and do stuff to your document that doesn’t change its meaning, but changes its bytestream. So we want to be able to transform the document in such a way as to make it always look the same, no matter how mangled it gets. XML Canonicalization actually fails at this, in that you can compare two logically identical douments: <a:foo xmlns:a=”http://www.foo.com”/> vs. <b:foo xmlns:b=”http://www.foo.com”/> – they don’t compare identical, but should. Even after canonicalization. But! Heaven forbid we say “screw it, let’s just say don’t muck with the data, and call it a day!” No no, that’s not the XML way! Instead you have to do all kinds of stuff. Turn empty tags into tag pairs, reorder attributes in each node, expand entities, strip some stuff, etc. And with “Exclusive XML Canonicalization” – the new-and-improved XML Canonicalization method used in SAML 1.1 – it gets even more confusing when you talk about your subset of the document and the namespace nodes that go with it. And then the spec’s wrong. And it turns out your test SAML assertion is canonicalizing using the method you already built 6 months ago, but is just calling it something else.
Sometimes the comedy of errors around all of this stuff makes me think that someone or something deliberately torpedoed it all. Perhaps Microsoft was concerned about some kind of interoperability utopia coming about, and they sent their agents to agitate for namespaces and xpath and xml signatures and enveloping and so on. Who knows.
If you ever find yourself in this unenviable position, first off, get xmlstarlet. If you don’t, you’ll never have anything to compare your own work to. I only got it late on in the process, and most of my real progress was after I got it. It requires libxml2, and libxslt. They’re handy to have around, though you may already have them. Once you’ve got that, read the specs very fuzzily. They’re not quite right, and Real Life trumps specs anytime. The end result is that it was not fun, at all. Very fulfilling in the end, when I finally see the message that the assertion’s digest and signature are ok, but not at all fun. And not code you want to show your mom. I don’t imagine myself working with this awful crap for quite a while again – or so I hope.
It’s funny (and It’s 2am, and I’ve drunk some Pepsi MAX, so I’m a little wired, so please indulge me) that you can see that there are any number of New Hip Cool technologies that start getting pushed really hard by companies, and end up being useful for some things, but not the panacea that they’re supposed to be. And you know how you can tell which technologies will end up being snake-oil? Look for the ones that claim they’ll end up powering a refrigerator that can automatically order milk when you’re running low. They’ve been saying that shit since “HTTP Push technology” was the exciting hip technology that was going to change the world. Let’s see, I’m sure I’m missing some, but the ‘hip technology that isn’t actually good’ list that I can remember would be…remote procedure calls…object oriented programming…then remote method invokes…client-server…Web 0.9 (everyone needs a single, static web page! Hosted on http://www.whateverhost.com/~companyname)…Push Technology…Web 1.0 is around there, oh I know B2C….B2B…Java…XML…Web 2.0…Y’know what it is now? Virtualization. It’s got its uses, sure. But having one big box and virtualizing a whole bunch of little boxes in it means you still have to manage a whole bunch of little boxes – they just live in a big one. Actual consolidation is better – moving a whole bunch of related functions onto one big box. The idea that you can move around the images is definitely neat, and over time, we always reduce our attachment to the bare metal of our computers – virtual memory, virtual volumes (logical volumes in Windows and Linux), why not virtualize the machine too? I just don’t see it as a cure for all ailments, and it does increase single points of failure (unless you do it right, but most don’t). Okay, now I’m getting legitimately tired, I’m going to bed.