Windows 8 from the point of view of a Mac user

So I should mention before anything else that I use Windows 8 just for fun. I work all week on my Retina MacBook Pro in OS X, and on evenings and weekends – when I want to play World of Tanks – I reboot into Windows. That’s about it. Make money on OS X, have fun in Windows. It’s kinda like the opposite of how it was in the 90’s (working in Windows, coming home to Mac).

So today when I launched World of Tanks, I realized I was in a bit of a rut, tank-wise. Quite a bit of XP in order to get the next set of tanks. What does that mean? Windows 8 upgrade time!

I was able to do the thing all online. $200. I would’ve liked to have stepped down from my Win 7 Pro down to Win 8 Home – I mean, all I do is game in the thing. Why do I need pro? But it didn’t give me a chance. Upgrade assistant was pretty reasonable. It complained at me for not having enough space – I uninstalled some Steam games to make room, and when I flipped back to the Assistant window, it had moved along to the next screen. Not bad.

As my friend Beckley and I have been discussing – Microsoft is throwing money away not making it easier for Mac people to get Windows. And making it way too expensive. If they put up an app in the App Store (presumably with Apple’s buy-in) they could put a $99 price tag on it and make some nice money. At higher margins than OEM licenses! But they’re dumb and short-sighted about things like that. Oh well.

Anyways, it still takes quite a few reboots for the install to complete. That was unexpected – but I guess that’s me not remembering my Windows-fu. Kept having to hold down the Option (alt) key to get it to boot into Windows. It threw me into a setup assistant and I somehow managed to inadvertently hook up my Xbox account to my Windows login. Freaky, but why not.

I ended up hooking up my Metro Home screen to my Facebook account too, as well as my Xbox account. I was surprised to see Xbox avatars for all of my friends pop up – relatively easily accessible from the home screen as I clicked around. I had to download some updates and the process was not as buttery as possible, but still not bad at all. My FB friends are around too. Again, not too terrible. Hooked in two of my three Gmail accounts into the Mail app. In the end, I had a brand new view of all kinds of ‘live tiles’ giving me a newfangled view of my PC.

The Start button in Windows has always been a disaster for me – it contained a billion things I didn’t care about, and 3 that I did. And when it had a scrolling view it was even worse. And that one view where it ‘automatically’ optimized where everything went? Even more of a disaster. So in Windows 8, Metro completely replaces the Start Menu. This new one actually makes sense to me. The three things I want to find? Right there, staring me in the face. I want to move them around? Easy and obvious. The one place it consistently throws me for a loop is when I want to find something that I would normally have to dig through the lesser-used folders in the Start menu to find (Start -> Program Files->SomeStupidCompanyName->DumbUtility…). That’s where my muscle-memory gets in the way. I can’t find something, I hit the ‘command’ key (Windows key) and the Metro screen comes up -> I can’t find what I’m looking for -> I click on Desktop-> I still can’t find what I’m looking for -> hit the Windows key again…

The ‘flat’ look seems timeless to me. Reminiscent of Star Trek:TNG’s LCARS pseudo-OS. No excessive curves and embossing and rounded edges and so on. Just really clean and flat. It starts to fall down a little bit when you look at Legacy Windows things – they look a little odd – trying to be flat like Metro but inheriting all the Windows baggage. And lots of things that are buttons don’t look Buttoney. So I can imagine I will find myself in a position where I have to scrub my mouse around to see if things are clickable all the time. That will certainly cause some level of UI-fail. It’ll be even worse on a touchscreen machine.

I’m even using IE10 (!) Because it shows up better on my Retina screen. I tried Google Chrome in ‘Windows 8 mode’ and it was ok – but the text was showing up too small. IE actually is not bad. I’m completely shocked by this. The font rendering still looks a touch off to me – some parts of letters look too thin maybe? I can’t put my finger on it. But I can read the screen, and that’s a nice start.

In the end, my feelings about the look/feel of the thing? I don’t see why everyone is all up in arms. It’s actually kinda futuristic-looking, IMHO. I actually find it very pleasant. Of course, I’m a bit of a contrarian – when I first used Windows Vista it didn’t bother me as much as it seemed to bother everyone else. And when I first used Windows 7 I didn’t think it was so super amazingly awesome like everyone else did. So take that into account. And also take into account – the only things I do in Windows are: play games, test things in IE, and poke around. I don’t usually try too hard to get much work done.

But as always, the devil is in the Details. And that’s where Apple tends to really knock things out of the park, and where Microsoft tends to fumble. Especially when I click on something that’s “classic” Windows from something that’s Metro, things feel janky and weird and awkward. The tile layouts in the ‘top free’ and ‘top paid’ sections of the App store are awful and useless. I can’t search the store either; if you don’t click on one of the ‘suggested’ apps, you’re screwed. The Skype integration seemed exciting – but then it was insisting on doing some kind of account merge that I didn’t want it to do. I still can’t figure out how to add a third email account. I can’t get number of unread messages to show up in the email tiles. It’s hard to get apps to show up in the Metro panel thingee if they aren’t there already. Or if you (ahem) accidentally ‘unpin’ one (oops!). My Hipchat (Adobe Air) app was all messed up and confused and it took quite a bit of cajoling to get it so I could have a normal window on my screen. (That could be Adobe’s fault, or HipChat’s fault, or Microsoft’s fault. Not sure.) As I continue to play with it, I’m sure I’ll find more to complain about. I couldn’t figure out how to open multiple windows in the Metro version of IE10 until I was doing final edits of this article (right-clicking somewhere plain on the page seems to bring up a contextual menu?)

But in the end I think Microsoft is trying some really clever stuff here. I think this Metro stuff really is the future. Apple broke with OS X precedent when it made iOS; and I think Microsoft is trying to do the same thing here with Win8/Metro. And in the same way people were up in arms and completely freaked out when Apple first removed the Floppy drive – and then later, the CDROM – I think that’s how the typical Microsoft person is responding to Metro.

I think Microsoft has latched on to Apple’s “Halo Effect” strategy. Apple had the iPod, and the iPod “Halo” started to cause a boost in Mac sales (and lead to the iPhone and iPad). Analogously, Microsoft has the Xbox. Perhaps, from this Xbox Halo, they can start to rebuild the strength of their Windows empire? Maybe. But remember – Microsoft is adopting Apple’s strategy here. And Microsoft was very good at taking someone else’s idea or product, imitating it, iterating a few versions of it, throwing in some dubious business practices, and then coming up with something that actually starts to crush the competition. They might be trying that again here. Though they haven’t really pulled that off in a while, I think.

My bet is that Microsoft-users will continue to whine and bitch and moan about how terrible and awful Windows 8 is. And we’ll have a service pack or two come out, and then maybe some kind of interim release – and then maybe people will get used to it and move along.

I think, most importantly, that if they didn’t obsolete themselves with Metro, someone else – probably Apple – would’ve done it for them. So they really had no choice.

In half-hearted defense of PHP

Okay now there’s a big rant about PHP. These things are getting exhausting. I don’t particularly like Python, that doesn’t mean I have to make any kind of long boring post about it. It’s just a matter of my personal preference, and I know that. Plenty of people do plenty of nice things in Python and that’s great. Twisted seems like a cool idea. But I don’t like Python. And no one cares, and no one should care.

So the rant on PHP has a similar tenor to the ones about Node.js – which already starts us off on the wrong foot. However, as I read through the whole thing, I find that I can’t disagree with any one particular point. Every single one seems at least conditionally valid, and at least none of them seem deliberate designed to be maliciously false. So, points to you sir and/or m’am on your post. My rebuttal follows.

I’ve been using PHP for somewhere around 10 or 15 years or so, and I haven’t run into even half of those various weird edge-cases. I’m not saying they don’t exist. They seem like they do. But I’m saying you don’t run into them so often.

However, I do feel that the long, exhaustive list of ‘things that are wrong with PHP’ sort-of misses the point. Yes, the language has weird bugaboos about it. All languages do. No, none of them are fatal. The language is easy to pick up and be productive in. It’s easy to have a mostly-static site with the odd dynamic page here and there. Deployment is a breeze. FTP up your new files and you’re done. You don’t even have to think about it. Sites run fast. Development is easy. The documentation is excellent. I haven’t seen any other environment that comes remotely close in that regard. Just about any thing you would want to be able to do, there is a function for. It exists to make dynamic web pages, so it fits that well.

The article also rather quickly breezes over an important point. Facebook uses PHP. Why is that? The article states that it’s fine for FB because they’re huge and can ‘engineer around’ the various weaknesses of the language. That certainly begs the question as to why they would bother. Is it just legacy? Are they just dumb? I have a hard time believing any of those possibilities.

Mass hosting for websites really only works for PHP. The security gets a little dodgy, but it basically kinda works. Try that with any other webdev environment and you wont get nearly the server-density that you do in PHP. That is pretty hard to beat.

Server-side includes. Anyone remember these? This is really what PHP has supplanted. For that kind of stuff, it’s great.

Or for a one-page site. Or a static site that has one dynamic page that does something – you cannot beat PHP.

Plenty of people use it with frameworks for complex sites. If I had a really super complex site to do, I honestly don’t know whether or not I’d do it in PHP. But if I did decide to do it there, I wouldn’t feel the least bit bad about it.

Edit, reload, etc. You can edit a PHP page on a server, hit reload, and it works. No magic. Or if there is magic in that, it’s so magical that you don’t need to ever know how it works. This is quite pleasant. No dicking around with production versus development or HUP signals or restart.txt files or any of that shit. On development or production; edit your file, reload. Boom.

Server infrastructure – since PHP is most often served up via Apache, you don’t have to throw some kind of reverse proxy in front of it or anything. If you have static assets, they will be quickly served up via Apache’s static web-page serving. All for free in the default set-up for PHP.

Performance. PHP is fast. Even without an accelerator. I’ve built many, many PHP apps, and I’ve never needed to use one – the database always ends up being the bottleneck, never the web front-end. It will likely outperform most other web environments with no tuning. I *have* had to tweak the maximum number of http processes in Apache on a very high-traffic site, and I’ve also messed with the maximum PHP memory to permit per-process, but (off the top of my head) those are the only two knobs I’ve had to twiddle. And I’ve only had to do that on sites with the highest-of-traffic running on the crappiest-of-server environments.

So if you want to live in your tiny ivory tower and yammer on endlessly on pedantic points about object hierarchies and namespace pollution and function-name consistency, feel free. However, those of us who are jamming out PHP sites will not be doing that – because we will have already finished the project we’ve worked on and moved on to the next one.

Practical experience with Mongo, and why I do not like it, in terms of Money and Time

For my job, I inherited a Mongo architecture. I resolved to learn it – and it still runs to this day, ticking along quite nicely.

This is what my feelings are about the platform having actually used it in production – not on little toy projects. Our main MongoDB server is a 67 GB RAM AWS instance, with several hundred GB of EBS storage.

First, the good parts:

It’s super-duper easy to set up and administer. Pleasant to do so, in fact.

Javascript in the console is a remarkably useful thing to have handy. I’ve used it for quick proof-of-concepts and tests and whatnot – really good to have.

It’s really nice to develop against. Not having to deal with schema changes, and being able to save arbitrary column-value associations makes life really easy.

And now, the bad (from unimportant to important)

Doing anything beyond a really trivial or simplistic query in the console is surprisingly annoying:

db.tablename.find({"name": {$gt: "fred"}})

Not a dealbreaker or anything, just annoying. name > "fred" would be nicer.

The default ‘mode’ of Mongo is surprisingly unsafe. I found drivers (could be the driver’s fault, might not be Mongo’s) that return _immediately_ after you attempt a write – without even making sure the data has touched the disk. This makes me uneasy. But there are probably use cases for this kind of speed at the expense of safety. But I don’t like it. And this is opinion, so I’m saying it. There _are_ modes that you can use (and I do use) that slow down the writes to make sure they’ve at least been accepted. But, in the conventional mode, if we have a crash before the writes have been accepted by Mongo, the data is gone. This has happened to us. Usually we have failsafes in place to ensure that the data eventually gets written, but it costs us time. We’re mortal; we’re going to use the defaults we get until they don’t work for us, then we’re going to change them.

Because there is no schema, a mongo “collection” (fuck it, I’m going to call it a table) takes up more storage than it should. Our large DB would be far smaller if we defined a schema and stored it in an RDBMS. This space seems to be taken up in RAM as well as disk. This costs us more money.

MongoDB starts to perform horribly when it runs out of memory. This is by design. But it’s annoying and it costs us more money and time because we have to either archive out old data (which we do in some cases), or use a larger instance than we ought to have to (which we also do). And even if you delete entries, or even truncate a table, the amount of space used on disk remains the same (see below). More money.

MongoDB will fragment the storage its given if your data gets updated and changes size. This caused us to end up storing around 20-30 GB of data in a 60-some-odd GB instance. And then we started to exhaust RAM, and performance plummeted. So we needed to defrag it. More care and feeding (time) that I didn’t want to spend.

So to ‘fix’ the fragmented storage issue, we had to ‘repair’ the DB. This knocked our instance offline for hours. Many hours. Time. The standard fix for this is to spin up another instance (money), make it part of a replication set, repair one, let it catch up, then repair the other. Time.

The final issue I had with Mongo was when I attempted to shard our setup. We were using a 67GB (quad-x-large memory) instance for our Mongo setup. I got advice from some savvy Mongo users to ‘just shard it.’ That made it sound so trivial. So I did. I figured we could go for 16GB shards and add a new one when we got heavy, and yank out one if we got light. I liked the idea of being able to save more money, and flexibly respond to requirements. So I set up a set of four shards – three “shardmasters” which coordinated metadata, and one ‘dumb shard’ which just stored data and synced up to the metadata servers. I blew the first time to get the config right. Whoops. I did it again, and this time, I did it right. I picked a shard key – not an ideal one, but one that would, over time, roughly evenly distribute data across all of our shards, while maintaining some level of locality for the most likely operations. I ran a test – it’s really nice to do in the JS console, I must say. I ran a for-loop to insert 10 million objects of garbage data, with a Modulo-10 of ‘i’ to simulate the distributions of the shard keys. I watched, in a separate console, as it threw data on one shard, then started migrating data from one shard to the others. It worked enormously well. So I yanked my test data, then we put production data on the thing.

It worked fine for a few days. The Mongo filled up a shard and blew up. It was a pretty huge, horrible catastrophe. It was hard for me to troubleshoot what, exactly, happened – but it looked like no data went onto any shard other than the first.

Now, I *was* using an ObjectId() as the shard key. Not the object’s _id, but the object_id of a related table. One that was nice and chunky – didn’t change very much except for every few hundred thousand records or so. It’s possible that I needed not to use a shard key that is an increasing ObjectId. It’s possible that switching from an integer going from 0-9 over to an ObjectId that increments somehow messed me up. I tried to figure out what happened, after the fact, and got similarly nowhere. I also checked documentation to see if I had done something wrong. While I wasn’t able to find anything definitive, there was mention about using an ObjectId as your shard key possibly throwing all traffic to just one shard. For our purposes, that would’ve been fine *if* the other ‘chunks’ of data got migrated off that shard, on to somewhere else. They didn’t. This whole ordeal cost us loads and loads of time. Again, I’m perfectly willing to take the blame for having done something wrong – but if so, then there’s something missing in the documentation.

So that was a complete nightmare. But it’s still not a technology I would discount – I can imagine specific use-cases where it might fit nicely. And it sure is pleasant to work with, as an admin and as a developer. But I’d far rather use something like DynamoDB (which seems very interesting), or something like Cassandra (which I’ve been following closely, but I have not yet put into production). In the meantime, I still use a lot of MySQL. And it definitely shows its age, and isn’t always pleasant, but generally does not surprise me.

Amazon cloud-init – customizing EBS-backed Amazon Linux AMI’s

EDIT – No, not even this works. I feel like I’m losing my mind.

EDIT 2 – Oh, apparently you *have* to specify the boot kernel. Have to. Can’t use “use default” as I have been for, like, ever. Ugh. Angry.

I just blew a horrible amount of time on this. I’ve burned many an AMI – based on ephemeral store and EBS-backed volumes. But trying to do it ‘right’ – with programmable private keys and whatnot – seemed to be out of my grasp, at least when using Amazon’s own Linux distro.

If you try to customize Amazon Linux you will find that some things that are normally done by cloud-init don’t seem to work on your image. Namely, setting ssh keys. It works fine when you first boot the pristine Amazon image, but when you try to burn your own it won’t seem to set the ssh keys properly.

To set them, make sure you blow out the contents of /var/lib/cloud/ – and both /root/.ssh/authorized_keys as well as /home/ec2-user/.ssh/authorized_keys. They’ll get reset on next boot.

This isn’t documented anywhere and I basically had to dick around with strace and flipping through all of the python code to figure out that there’s a semaphore file in /var/lib/cloud/sem that gets set and then the ssh-setting-script at boot will never run again. It makes me angry – but maybe that’s Amazon’s point; they don’t want you to customize their image so they can save on EBS volume space. I don’t know. Pisses me off and wastes my time for sure though.

You would think that at least when I try to run stuff by hand it would say “Oh, hey, there’s a semaphore file right here – make sure to yank it if you really want to run your scripts again.” Not this silent no-message bullshit.

ARGH.

-B.

Follow up on Amazon Elastic Load Balancers and multi-AZ configuration

I got a really good comment on my blog a day or so ago from a guy by the name of Mark Rose (that’s the only link I have for him, sorry!) He mentioned that AWS multi-AZ load-balancing happens via DNS – which intrigued me – so I thought I’d mess with my test load balancer and see.

He explained that each AZ gets its own DNS entry when you look up the load balancer – and that meshes exactly with what I’m getting. I do the DNS lookup for the LB, and get two IP addresses right now – and I’m assuming that each one corresponds to one of the LB’s.

But Amazon does some interesting DNS stuff – for instance, if you look up one of your ‘public DNS names’ of your instances from the _outside_, you get the instance’s outside IP address. But if you look it up from the _inside_, you get the inside IP. I use this for configuration settings, when I want an instance to have a relatively-static internal IP. Instead of trying to pin down the IP, I set up an elastic IP for the instance, and use the permanent public DNS name for that IP as the _internal_ hostname for the instance. This way, if the instance reboots, I just have to make sure that the elastic IP address I’ve configured is still associated with it, and everything still works internally.

I assume that traffic to the inside IP address is faster than bouncing back outside to the public address, then going back inside. I definitely know that it is cheaper – you don’t pay for internal network traffic, only external.

So my question is – what does it look like when you try to resolve the load balancer’s DNS name from the _inside_ of Amazon AWS? Do you get the same outside IP addresses, or do you get internal ones instead? Since it seemed like AWS traffic ‘tends’ to be directed back to the same AZ it originated from, I expect to get different answers.

So here’s what I did. Set up an ELB with two AZ’s – us-east-1a and us-east-1e. I installed apache and launched it on both. As soon as the ELB registered the instances as ‘up’, I did a DNS lookup from the outside to see what it resolved to.

I got exactly two addresses – I’m assuming one points to one AZ, one to another.

Then, I tried to resolve the same ELB DNS name from the _inside_. Weirdly enough, I *still* get both (outside) IP addresses! I didn’t expect that.

So now, I wonder, is there anything to ‘bias’ the traffic to one AZ or another? Or is it just the vagaries of DNS round-robin that have been affecting me?

I changed the home pages on both apaches to report which AZ they’re in. I then browsed-to, and curl’ed, the ELB name. The results were surprisingly ‘sticky’ – on the browser, I kept seeming to hit ‘1-a’. On curl, I seemed to keep hitting 1-e.

What if I specifically direct my connections to one IP or another? Let’s see.

Since the ELB IP addresses seem to correspond, one-to-one, with AZ’s, I thought I would be able to curl each one. I did, and consistently got the same AZ for each IP. One seems to be strongly associated to 1-a, and one to 1-e.

So it seems the coarseness of the multi-AZ ELB load-balancing can be fully explained by the coarseness of using round-robin DNS to implement it.

Something else to note – it seems like the DNS entries *only* have 60 second lifetimes. With well-behaved DNS clients (of which I will bet there are depressingly few), you should at *least* be able to end up changing the AZ you’re banging into every 60 seconds. However, in my testing – brief though it may be – it seems to stay pretty ‘sticky’.

So what does this mean? I dunno – I feel like I want to do multi-AZ setups in AWS even less now. round-robin DNS is old-school, but at large enough scales does generally work. Though I wonder if heavily-hit web services API’s like the ones my company provides fit will enough into that framework? I’m not sure.

Session stickiness and multi-AZ setups

Another question – how does this affect ‘stickiness’? You can set an LB to provide sticky session support – but with this IP address shenaniganry, how can that possibly work?

Well, weirdly enough – it actually does.

I set an Amazon load-balancer-provided stickiness policy on my test LB. I curl’ed the name, and got the cookie. I then curl’ed the individual IP addresses for the load balancer, with that cookie set. And now, no matter which IP I hit, I keep getting the results from the same back-end server. So session-stickiness *does* break-through the load-balancer’s IP-address-to-AZ associations, to always keep hitting the same back-end server.

I wonder, what does the AWS-provided cookie actually look like? It seems like Hex, so let me see if I can decipher it.

Since I don’t know if anything scary is encoded therein, I won’t post my cookie here, but when I tried to decipher it, I just got a bunch of binary gobbeldygook. Stayed consistent from request-to-request (maybe modulo time, not sure), so probably just encodes an AZ and/or an instance ID (and maybe time).

Implications

So since AWS exposes some of the internal implementation details of your load-balancer setups, what does this mean? It certainly does imply that you can lower the bar for DoS’ing a website that’s ELB-hosted by just picking one of the ELB IP’s and slamming it. For a two-AZ example – as opposed to having to generate 2x traffic to overwhelm a site, you can just pick one IP and hit that one with 1x and have the site go half-down from it.

Considering the issues I’ve actually run into from having autoscaling groups that won’t scale because only one AZ is overwhelmed, I wonder if it makes sense to only have autoscaling groups that span a single AZ?

And it also seems to imply that you can DoS an individual server by hitting it with a session-cookie that requires it to always hit the same back-end server. So perhaps, for high-performance environments, it makes sense to stick with shared-nothing architectures and *not* do any kind of session-based stickiness?

RightScale-to-Native Amazon Web Services (AWS) Name Synchronizer

At my company, we use RightScale for a lot of our Amazon Web Services management. It’s a pretty neat service – sort of “training wheels” for the cloud. Still provides us a lot of value.

But sometimes I like to log directly into the AWS console. Especially to find out when Amazon has scheduled reboots of our servers. Before I wrote this script, I would log in to find a whole bunch of instances running with no names. Then I’d have to go look them up in RightScale. Why can’t RightScale just name your Amazon instances with the right names?!

Well, I finally took matters into my own hands and built the following script. It walks through all of your RightScale servers, and finds the associated Amazon instances and sets their name attributes to the RightScale “nicknames.”

And I got permission from my job to make it available to the public – so here it is:

https://github.com/uberbrady/RightScaleNameSynchronizer

Yes, it is not the prettiest code I have ever written, but it does the trick. If someone wants to make it prettier I am definitely open to pull requests.

One thing I have noticed is that when you ‘relaunch’ a RightScale instance, the new instance will come up without an AWS name. If you re-run the script that will fix that. Also, if you use any RightScale arrays, the same thing can happen during scale-up/scale-down events.

Amazon Elastic Load Balancer (ELB) performance characteristics

So at my new job, I get to use AWS stuff a lot. We have many, many servers, usually sitting behind a load-balancer of some sort. Amazon’s documentation on these things isn’t very clear, so I’m trying to figure out what the damned things are doing.

First off – a good thing. These Load Balancers are really easy to use. Adding a new instance is a few clicks away using the Console.

Another good thing – you can use multiple availability zones as a way to avoid trouble when an entire Availability Zone goes down (as happened less than a year ago). And here’s where it gets ugly.

It seems to evenly split traffic across zones – even if you don’t have a balanced number of instances in each zone. And it seems to determine which zone to hit based on some hash of the source address. So, if you have 2 instances in an Autoscale group, in two availability zones, and you hit your array from one IP really hard – bad things ensue. The one AZ you’re hitting will max out the CPU, and the AZ you’re not hitting will be nearly 100% idle. That adds up to 50% utilization on average – not enough to cause a scaling event (with my thresholds, anyway).

So, in short, if you have fojillions of people from all over hitting your services indiscriminately, I’m sure it’ll be fine. But, if like me, you have 1000’s of people (or so) from all over, some really hard from one particular IP – it may not be a good idea to try to spin up more than one AZ. And spinning up 4 AZ’s seems silly – you’ll definitely have more of a chance of at least one of them going bad, and until your load-balancer figures that out, you’ll have one out of ‘n’ requests failing.

Another thing I’ve noticed – the ELB’s seem to have a strict timeout on accessing the back-end service. If it doesn’t get a response within 30 seconds, it’s going to drop the connection and hit the service again. I had a service that was nearly getting DOS’ed by the load balancer that way. Make sure you have sane timeouts.

So the next thing I was curious about was whether or not the ELB would do any batching-together of any of the returned data – would it act like some kind of reverse-proxy? I wrote a tiny Node server which spat out a little data, waited 10 seconds, spat out some more, waited another 10 seconds, then finished. Here it is:


#!/usr/local/bin/node
"use strict";


var http=require('http');


http.createServer(function (req,res) {
  console.warn("CONNECTION RECEVIED - waiting 10 seconds: "+req.url);
  res.writeHead(200, {'Content-Type': 'text/plain'});
  res.write("initial output...n");
  setTimeout(function () {
    console.warn("first stuff written for "+req.url);
    res.write("Here is some stuff before an additional delayn");
    setTimeout(function () {
      console.warn("second stuff written for "+req.url);
      res.end('Hello Worldn');
    },10000);
  },10000);


}).listen(80);

So – I couldn’t tell any difference between hitting the server directly, or hitting the load balancer (if I telnetted straight to port 80). It acted the same way. There was definitely no batching.
What about if I have a slow-client, or something that’s trying to upload something – will it get batched together on the POST side?
I modified my code to do a ‘slow POST’ – and it worked similarly – I couldn’t actually tell the difference between running that on the load-balancer or running it on the instance directly.
I also wrote code to generate a large (1MB) response dynamically on the server-side, then wrote a client that would receive a little, pause the stream, resume it after a few seconds, pause it again, and so on. The one thing I noticed different between accessing the server directly versus the load balancer was that the server tended to give me data in 1024 byte chunks, whereas the load-balancer was giving me blocks closer to 1500 bytes. Weird! Well, maybe not – I *do* know for sure that the LB is reterminating the connection at the TCP level – the source IP address changes. I was writing the data in blocks of 1k, so maybe each write turned into exactly one packet of 1024 bytes? But in the LB side, the LB, when re-streaming my TCP data, was sending larger segments. Or so it would seem.
So it looks like I can’t get rid of the reverse-proxy that sits on top of some of our Ruby code. Oh well.

More on (Moron?) Ted Dziuba

(Math and pedantry ahead. Feel free to skip if that’s not your thing. More stuff about Node.js and this guy Ted Dziuba, who I hate.) To start, read Mr. Dziuba’s latest blog rant about Node.js.

I will make two main points here. First, that Mr. Dziuba does not intend to be comprehended; that he is deliberately phrasing points so as to confuse. Second, I will prove that his arguments are invalid. It appears to me that he has no interest in deriving any kind of truth or shedding any light on anything; he is merely trying to draw attention to himself, or draw blog traffic, or just make noise. I don’t know what his reasons are. I just know what he’s doing, and that seems to be being deliberately misleading. I would not feed the troll, as it were, except for the fact that his blog post is permanently on the internet and I’m going to end up being pointed to it someday if I ever propose an event-loop based solution for anything.

First off, let’s try to decipher what he’s saying; he does not do a very good job of being clear.

Let’s look at Theorem 1.

What does it actually say? Here’s an attempt at deciphering it.

He asks, let’s check out how things work when you have something that’s heavily CPU-biased (less I/O).

Note = the value ‘k’ is the ratio of I/C – in other words, for something really really IO intensive (I > C) then k should be ‘big’ (greater than one), and for I < C, k should be small (less than one). You can think of ‘k’ as the “IO-ish-ness” factor. It’s big for something that’s very IO-ish, and it’s little if it’s not. Why doesn’t he explicitly state the definition that k=I/C? Because he has no desire to be understood; he’s attempting hand-waving. Everywhere he uses this ‘k’ construct he could just as easily use I & C.

The important definition is: W=I+C=kC+C=(k+1)C

Therefore K is I/C

(k+1)C is equal to W = the wall clock time of I + C.

His theorem begins with the supposition:

1000/C > 1000N/(k+1)C

What does that mean?

Let’s change his equation to make more sense of it. Since (k+1)C=(I+C) by definition, he’s really just saying:

1000/C > 1000N/(C+I)

He’s trying to suppose that *IF* the number of times I can execute just the CPU-part of my event-loop code is greater than the number of times I can do that, threaded, but also with I/O time taken into account, *THEN* it must be the case that the number of threads I am using is one. Why would you make the argument like that? The same argument can be made, much more simply, by saying:

1000/(C+I) > 1000N/(C+I) only if N is one. But the problem is then you can see what he’s doing. He doesn’t want this, hence the pointless variable substitution.

Notice that ‘N’ factor on there? He is saying that a system with 2 threads runs twice as fast as a system with one thread. And apparently a system with 100 threads runs 100 times as fast as a system with one thread. I’ve worked with much software throughout my personal and professional life, and this supposition is not true. By this assumption, of course threads will always outperform event-loop software.

He attempts the same song-and-dance in Theorem 2. He still is making the assumption that N threads equals N*single-threaded performance.

It’s most clear in his “Practical Example.” There, you can see him making the n-threads-means-n-times-performance argument most clearly. If that’s true, why not 1000 threads? Why not a million?

Another point here. Why threads? Why not fully fork()’ed processes? His math (such as it is) still holds up just the same if you assume forked contexts as threaded. And, yet, none of his math requires threads instead of forks to run.

Effectively, he has proven that in a system with an infinite number of infinitely fast CPU’s, and infinite RAM, and zero threading context-switch time, and zero thread-accounting time, that threads are faster than events. Congratulations.

So there are my arguments as to why he is incorrect. Now I wish to ask questions about how he seems to be deliberately misleading.

First off – some stylistic questions. Why has he written his argument so obtusely? Why has he not shown his work? Why does he not explain what he’s supposing? He just throws symbols down, in beautiful little .PNG files, and runs off with manipulating them with algebra. That’s seems like he’s trying an “appeal to authority” via jargon. Why all the milliseconds everywhere? We’re in theoretical Comp Sci world now, why pick those units? It would appear that he has done so specifically to throw 1000’s in his equations everywhere, just to confuse things further.

Next – a more theoretical question. Why do things like the select() or poll() system calls exist? Or epoll or /dev/poll? Since they’re so “obviously” inferior to threading-based solutions, they shouldn’t exist at all, right? There should be no use for them. If I can always just use threaded I/O instead of event-looped, why use event-looped at all? It is, after all, very difficult to program.

And finally – why did Dziuba himself advocate for an event-based I/O solution – “eventlet” – in one of his own blog posts? He seems to have gotten quite the performance boost –

…but the one that really stands out in the group is Eventlet. Why is that? Two reasons:

1. You don’t need to get balls deep in theory to be productive with Eventlet.
2. You need to modify very little pre-existing code to adapt a program to be event-driven.

This all sounds great in theory, but I have actually made a large I/O bound program work using monkey patching and changing the driver. It is a piece of software that reads jobs from a queue and processes them, putting the result in memcached. For esoteric reasons I will not go into, the job processors could not thread the work, they had to fork. Using this setup, one production box with 8GB of RAM was consistently 7.5GB full. After a less than 5 line code change to the driver, that same production box uses only around 1GB of RAM consistently, and can handle 5 to 10x the throughput of the old system.

The answers to these questions I cannot be sure of. As much as I would like to imagine that Mr. Dziuba is simply terribly ignorant; it would seem far worse – that he just intends to say things that are untrue for the purpose of drawing attention to himself.

Node.js is not a cancer, you are just a moron

My tone is going to seem strangely even and un-ranty. This is because I am doing everything I can to keep myself from completely exploding when I read this bullshit that this moron is spewing. OK, that was a little ranty, but the rest will read evenly. Maybe.

So one of my programming friends posts an article at http://teddziuba.com/2011/10/node-js-is-cancer.html and says, “Ah, here’s what’s wrong with Node.js!”

The article is rather strongly written – “Node.js is Cancer”, “node.js nonsense”, “Node.js is a tumor on the programming community”, “completely braindead”, “Scalability disaster”, etc.

He then shows a Fibonacci sequence and how it performs badly under node.

The problem he has proposed is, fundamentally, CPU-bound. I wrote a version of it in C and it did perform faster than it did in Node, but still, the problem definitely took finite-time. My command-line Node.js version calculated the answer in 8 seconds, the C version did it in 4. I was rather impressed that Javascript (Node.js’s V8 engine) was able to come as close to C’s performance in pure CPU-bound execution.

The problem, and what the author perhaps misunderstands, is that this is not the situation in which Node is an ideal solution. I use Node.js in production for work – and I know of many other shops that do too. If the problems you are dealing with are CPU-related, Node.js will not help you. Node.js works well when your problems are I/O-related -e.g., reading something out of a database, running web servers, reading files, writing files, writing to queues, reading from queues, reading from other web services, aggregating several web services together, etc. The reason that this solution has become so popular of late is because these are the types of problems that are most common in web development today. Thus, node.js becomes a helpful arrow in one’s quiver with which to solve these types of issues.

Considering that the article’s author seems to have some level of experience, I wonder if his choice of skewed example was perhaps deliberate. He has other articles on his blog about other event-loop libraries. His comment at the bottom – “tl;dr – Node.js is an unpleasant software library and I will not use it” – is possibly the real source of his anger. And – an irrefutable point – if you don’t like something, you don’t want to use it, and he obviously doesn’t. That’s fine.

Node is a tool; one of many – no panacea. If you’re dealing with problems of ‘slow’ services that need to wait for various bits of I/O to complete in order to return a result – it can be a very powerful and useful tool. If you’re computing the fortieth member of the fibonacci sequence recursively, it won’t be.

The sad fact is that the author’s completely valid point – that Node.js isn’t a good tool for CPU-bound problems – is completely buried in his bile. This is because he never states that, explicitly. Node.js has other drawbacks as well – it’s very easy to end up in callback-spaghetti, it’s very minimal, and it’s very very very young. The database integration libraries have some pretty serious immaturity issues to work through; and I’ve had to code around a good deal of that.

It’s a tool that’s good at particular things, and I will continue to use it for those things. Those ‘things’ tend to be the bulk of what web development and web services development actually are. So when I can write a two hundred line program that can replace entire arrays of servers and interconnected services with just one server; I am going to do that, and I won’t feel particularly braindead in doing so.

node.js scope weirdity

So I’ve been using Node.js a lot in my new job. Quick note: it’s super awesome. The job, and node.js. Anyways. I’ve put together a couple of non-trivial pieces with it and one thing that keeps tripping me up is: when is my variable in and out of scope? So I thought I’d write this up to see if I can explain it.

First example

Let’s look at this simple server code – it’s just a dumb webserver (shamelessly stolen from the node.js home page) that says ‘hello world’ and spits out a connection count:

var http = require('http');
var conncount=0;
http.createServer(function (req, res) {
  conncount++;
  num=conncount;
  res.writeHead(200, {'Content-Type': 'text/plain'});
  res.write("Here is some stuffn");
  res.write("And the connection count is: "+conncount+"n");
  setTimeout(function() {
   res.end('Hello World: conn count was: '+conncount+' and my connection # is: '+num+'n');
        conncount--;
  },5000);
}).listen(1337, "127.0.0.1");


console.log('Server running at http://127.0.0.1:1337/');

So if I just curl that (curl http://localhost:1337/), I get:

Here is some stuff
And the connection count is: 1

…5 seconds pass, and then…

Hello World: conn count was: 1 and my connection # is: 1

So that seems to make some sense. However, what happens if I run the code 12 times? This:


Here is some stuff
And the connection count is: 1
Here is some stuff
And the connection count is: 2
Here is some stuff
And the connection count is: 3
Here is some stuff
And the connection count is: 4
Here is some stuff
And the connection count is: 5
Here is some stuff
And the connection count is: 6
Here is some stuff
And the connection count is: 7
Here is some stuff
And the connection count is: 8
Here is some stuff
And the connection count is: 9
Here is some stuff
And the connection count is: 10
Here is some stuff
And the connection count is: 11
Here is some stuff
And the connection count is: 12


…then 5 seconds elapse, then…

Hello World: conn count was: 12 and my connection # is: 12
Hello World: conn count was: 11 and my connection # is: 12
Hello World: conn count was: 10 and my connection # is: 12
Hello World: conn count was: 9 and my connection # is: 12
Hello World: conn count was: 8 and my connection # is: 12
Hello World: conn count was: 7 and my connection # is: 12
Hello World: conn count was: 6 and my connection # is: 12
Hello World: conn count was: 5 and my connection # is: 12
Hello World: conn count was: 4 and my connection # is: 12
Hello World: conn count was: 3 and my connection # is: 12
Hello World: conn count was: 2 and my connection # is: 12
Hello World: conn count was: 1 and my connection # is: 12

So my question is, why does it do that? Each execution of my function _should_ have its own stack, no? And so wouldn’t each stack have its own variables?

Now, mind you – I know a (horrible) way to fix this – wrap my setTimeout call in an anonymous function and pass ‘num’ as a parameter – but what I don’t really get is ‘why’? I threw this line all the way at the end (with apologies to Haddaway) –

setTimeout(function() { sys.debug("What is num! Baby don't hurt me, don't hurt me, no more..."+num)},10000);

(And I had to require('sys') at the top too)

And, in my terminal with Node running, I got:


DEBUG: What is num! Baby don't hurt me, don't hurt me, no more...12

What?! I would’ve expected ‘num’ to fall out of scope?! Why wouldn’t that function scope up there make ‘num’ only exist for this execution? Is there no concept of ‘stack’ or anything? And even if there wasn’t any, each execution of my function is an execution and should ‘freeze’ the variable or something, right? Apparently not.

So what happened? Well, I can tell you – that variable ‘num’ that I referenced, since I *didn’t* define it using ‘var’, is GLOBAL. So that’s why it’s acting so global. Simply adding ‘var’ to the definition (var num=conncount;) made it start working properly. E.g., after the delay, my output became:

Hello World: conn count was: 12 and my connection # is: 1
Hello World: conn count was: 11 and my connection # is: 2
Hello World: conn count was: 10 and my connection # is: 3
Hello World: conn count was: 9 and my connection # is: 4
Hello World: conn count was: 8 and my connection # is: 5
Hello World: conn count was: 7 and my connection # is: 6
Hello World: conn count was: 6 and my connection # is: 7
Hello World: conn count was: 5 and my connection # is: 8
Hello World: conn count was: 4 and my connection # is: 9
Hello World: conn count was: 3 and my connection # is: 10
Hello World: conn count was: 2 and my connection # is: 11
Hello World: conn count was: 1 and my connection # is: 12

Such a terribly easy way to blow up your javascript! So apparently node.js supports “strict mode” – just make the first line of your javsascript code say:

"use strict";

(Note, that’s just a string, with the quotes. A javascript parser will just ignore it if it doesn’t understand it. You could also put a line in the middle of your code saying "poop"; and it would be ignored the same way).
Now with strict mode enabled, the previous version of my code (without the ‘var’ declaration) says:


ReferenceError: num is not defined

So I think I’ll be using this from now on. Unless ‘strict’ mode starts making me crazy – which is certainly also possible.

Next Example

"use strict";
var sys=require('sys');
for(var i=0;i<10;i++) {
 setTimeout(function() {sys.debug("I is now: "+i)},1000);
}

(Notice how I've learned my lesson? Yeah, I don't need concurrency bugs biting me in the ass, thankyouverymuch.)

The output is, unfortunately:

DEBUG: I is now: 10
DEBUG: I is now: 10
DEBUG: I is now: 10
DEBUG: I is now: 10
DEBUG: I is now: 10
DEBUG: I is now: 10
DEBUG: I is now: 10
DEBUG: I is now: 10
DEBUG: I is now: 10
DEBUG: I is now: 10

So this one - I definitely know how to fix. The problem is that by the time the timeout actually _fires_, the value of 'i' will be different - in this case, incremented all the way to 10. I need to somehow 'freeze' the value of i within the timeout.

So I would do:

"use strict";
var sys=require('sys');
for(var i=0;i<10;i++) {
 setTimeout((function(number) {return function() {sys.debug("I is now: "+number)}})(i),1000);
}

Which results in:

DEBUG: I is now: 0
DEBUG: I is now: 1
DEBUG: I is now: 2
DEBUG: I is now: 3
DEBUG: I is now: 4
DEBUG: I is now: 5
DEBUG: I is now: 6
DEBUG: I is now: 7
DEBUG: I is now: 8
DEBUG: I is now: 9

The problem is, that's ugly as shit. What better way to do it is there that's more readable, maintainable, debuggable, etc? And that function instantiation thing gives me the willies. Well, I don't know the best answer for that yet. How about this:

"use strict";
var sys=require('sys');
for(var i=0;i<10;i++) {
 (function(number) {
  setTimeout(function() {sys.debug("I is now: "+number)},1000);
 })(i);
}

(The output is still the same). That feels a little less awful and unreadable - and doesn't give me the anonymous-function-returning-function yucky feelings that the previous one did. (Though it still is effectively doing that, isn't it?) The crazy squiggly brace, close-paren, open-paren business is still a little awkward though.

A piece of advice I got from the node.js group seemed pretty sage, in terms of making this stuff more readable:

"use strict";
var sys=require('sys');

function make_timeout_num(number)
{
 setTimeout(function() {sys.debug("I is now: "+number)},1000);
}

for(var i=0;i<10;i++) {
 make_timeout_num(i);
}

(output is still the same again). And, wow, yeah, that's a hell of a lot more readable, at the expense of 4 or so more lines. But sometimes, logically, you don't want to split out your functions like that - if every time you need to freeze something in a scope you have to declare a function somewhere, your eyes will have to scan all over the place, and that could be ugly. So you could maybe declare the function within the for loop - though that's still in the global scope, it would just be for readability's sake.

I think I'll probably stick with the previous one, with the anonymous function declared in-line. It's not too insanely unreadable, and it's compact enough. If the contents of my anonymous function gets a few lines long, or gets a few variables deep, I might split it out into its own function for readability's sake.