The Daily WTF
CodeSOD: One Last ID
Chris's company has an unusual deployment. They had a MySQL database hosted on Cloud Provider A. They hired a web development company, which wanted to host their website on Cloud Provider B. Someone said, "Yeah, this makes sense," and wrote the web dev company a sizable check. They app was built, tested, and released, and everyone was happy.
Everyone was happy until the first bills came in. They expected the data load for the entire month to be in the gigabytes range, based on their userbase and expected workloads. But for some reason, the data transfer was many terabytes, blowing up their operational budget for the year in a single month.
Chris fired up a traffic monitor and saw that, yes, huge piles of data were getting shipped around with every request. Well, not every request. Every insert operation ended up retrieving a huge pile of data. A little more research was able to find the culprit:
SELECT last_insert_id() FROM some_table_nameThe last_insert_id function is a useful one- it returns the last autogenerated ID in your transaction. So you can INSERT, and then check what ID was assigned to the inserted record. Great. But the way it's meant to be used is like so: SELECT last_insert_id(). Note the lack of a FROM clause.
By adding the FROM, what the developers were actually saying were "grab all rows from this table, and select the last_insert_id once for each one of them". The value of last_insert_id() just got repeated once for each row, and there were a lot of rows. Many millions. So every time a user inserted a row into most tables, the database sent back a single number, repeated millions and millions of times. Each INSERT operation caused a 30MB reply. And when you have high enough traffic, that adds up quickly.
On a technical level, it was an easy fix. On a practical one, it took six weeks to coordinate with the web dev company and their hosting setup to make the change, test the change, and deploy the change. Two of those weeks were simply spent convincing the company that yes, this was in fact happening, and yes, it was in fact their fault.
[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.CodeSOD: Identify a Nap
Guy picked up a bug ticket. There was a Hiesenbug; sometimes, saving a new entry in the application resulted in a duplicate primary key error, which should never happen.
The error was in the message-bus implementation someone else at the company had inner-platformed together, and it didn't take long to understand why it failed.
/** * This generator is used to generate message ids. * This implementation merely returns the current timestamp as long. * * We are, thus, limited to insert 1000 new messages per second. * That throughput seems reasonable in regard with the overall * processing of a ticket. * * Might have to re-consider that if needed. * */ public class IdGenerator implements IdentifierGenerator { long previousId; @Override public synchronized Long generate (SessionImplementor session, Object parent) throws HibernateException { long newId = new Date().getTime(); if (newId == previousId) { try { Thread.sleep(1); } catch (InterruptedException ignore) {} newId = new Date().getTime(); } return newId; } }This generates IDs based off of the current timestamp. If too many requests come in and we start seeing repeating IDs, we sleep for a second and then try again.
This… this is just an autoincrementing counter with extra steps. Which most, but I suppose not all databases supply natively. It does save you the trouble of storing the current counter value outside of a running program, I guess, but at the cost of having your application take a break when it's under heavier than average load.
One thing you might note is absent here: generate doesn't update previousId. Which does, at least, mean we won't ever sleep for a second. But it also means we're not doing anything to avoid collisions here. But that, as it turns out, isn't really that much of a problem. Why?
Because this application doesn't just run on a single server. It's distributed across a handful of nodes, both for load balancing and resiliency. Which means even if the code properly updated previousId, this still wouldn't prevent collisions across multiple nodes, unless they suddenly start syncing previousId amongst each other.
I guess the fix might be to combine a timestamp with something unique to each machine, like… I don't know… hmmm… maybe the MAC address on one of their network interfaces? Oh! Or maybe you could use a sufficiently large random number, like really large. 128-bits or something. Or, if you're getting really fancy, combine the timestamp with some randomness. I dunno, something like that really sounds like it could get you to some kind of universally unique value.
Then again, since the throughput is well under 1,000 messages per second, you could probably also just let your database handle it, and maybe not generate the IDs in code.
[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.