露の世は露の世ながらさりながら。。。: Scaling is hard

Hello readers! I apologize for the lack of posts for the past... (checks calendar) six months... *cough* I will resume my series of posts on change tracking in the near future! In the meantime...

Enterprise scaling is hard. Anyone who says differently is selling something. Due to the incredible public visibility (to say nothing of the... investment... some of the fans bring), scaling in video games adds a miles-thick layer of difficulty. Recently, Bungie has had a few significant bugs crop up in their flagship title, Destiny 2. These bugs affected player inventories, and led to unplanned downtime while fixes were implemented. Twice. (Insert your choice of 'oof' meme here.) Bungie has published a good post-mortem that, while understandably light on the technical details, gives some insight into what they were doing and how things broke.

This incident called to mind something I worked through. It's absolutely not the same level of issue, but it's illustrative of the interesting things you see at scale. A previous employer's business requirements meant we had servers close to just about every major city on the planet, and they were kept fairly busy. Due to the nature of the business even a server that wasn't busy had to be very responsive, and a busy server that got slow was a nightmare for customers.

One server began periodically getting slow. REALLY slow. A proc that averaged a worker time of under a millisecond (props to the developers there: the tables were well-normalized, the queries well-written, and they understood indexing so the indexes were sensible as well) started taking over a second to respond. Our initial thought was that it had generated a bad plan by dumb luck (due to the nature of the data and how it changed over time, plans got regenerated a few times a week following stats maintenance, so far so normal), so we started by comparing the plan on this server to the plans from a sampling of its compatriots. The plan was indeed different (and bad!) on this one, so we gave the business a heads-up that we were going recompile that proc, we'd probably see a long execution while the new plan was generated, and so on. Recompiling worked a treat, the new plan was good, and matched the good plans on the other servers, so we were happy, the business was happy, everybody was happy.

A few weeks later, this one server started acting up again. Meet the new bad plan, same as the old bad plan. But only on this one server. We looked at a different sample of other servers of the same type - all consistent among themselves, and they all had generated the same good plan. So, again with the recompile, which made everything happy again. Once again the new plan was good, matching the good plan everywhere else.

We didn't wait for the third time to assume that it wasn't just dumb luck causing the bad plans. Parameter sniffing is fun, after all. One of the fun parts here was that even though the plan WAS using the desired index, it was just... wrong. It's worth noting that all the servers of this type had the same source tables replicated down to them. No local writes, just reads, and the data sets were identical.

So why was this lone box coming up with this suboptimal plan even using the correct index? Well, one common thread was that the bad plans were generated from one parameter value. The application these servers supported did a nightly test, and each server had a short list of test values that they'd cycle through. The tests were server-specific so no two servers would use the same test records. The server tests were done during slow hours for that region, and just so happened to happen after index maintenance was done. Every so often, the old plan would be cleared after maintenance and we'd get the bad plan when the right test parameter was passed in.

I started digging into the index. DBCC SHOW_STATISTICS is tremendously useful for understanding what a given index looks like under the hood. The histogram in particular was helpful, as it allowed me to tell where in the index the bad parameter was found. I pulled copies of the various objects related to the issue down into my sandbox and set to work on it.

After ensuring I could reproduce the problem at will, I set to checking the parameters used to generate the good plans. Some of the values corresponded to test records, some corresponded to customer data, but interestingly they all were found in different steps in the index. I noticed that the steps weren't all of identical size - the step with the bad parameter had a larger range than the other steps. Out of curiosity, I tried passing some other parameters from the same step as the parameter that produced the bad plan, and those parameters also caused the same bad plan to be generated. Similarly, other values from the steps that contained test values that generated good plans would generate good plans.

So far, so good. I did some inquiring and found that the bad parameter value was the only test value in that index step, but there was always the chance that a plan could be generated using some other parameter from that step if the plan fell out of the cache for whatever reason.

My next step was to see if changing our indexing strategy would help. The table had grown a lot over time since the application was rolled out, and maybe we'd hit the point when rethinking things was the right thing to do. A discussion of the nuances of index design is far outside the scope of this post, but as with most things the online documentation covers it pretty well, and of course many smart people have spilled a lot of virtual ink on the topic; your search engine of choice is your friend here.
I tried a number of things both on my own and in consultation with folks who know a lot more about indexes than I do, but nothing helped. While I was able to get the optimizer to consistently generate a plan that was slightly better than the bad plan regardless of the parameter passed, it was still substantially worse than the good plan. The solution wasn't to be found in the indexes, to be sure.

In the end, we wound up creating a job that checked the stored procedure's performance and would force a recompile if the maximum worker time was over a certain duration. We scheduled the job to run every few minutes, and this took care of the problem sufficiently well. We shortly deployed this to all the relevant SQL instances, and it turns out that it periodically helped prevent the problem on other servers as well over time. I had an opportunity to talk with some folks from Microsoft later, and they agreed that under the circumstances the recompile job was the best solution.

Scale gets strange at times. Things that you never encountered before start happening more and more frequently, and sometimes you have to approach the problem differently than you think.

露の世は露の世ながらさりながら。。。

Sunday, February 16, 2020

Scaling is hard

No comments:

Blog Archive

Links