How do you find a performance problem in a working system at a customer site? The ones used to running those sophisticated profiler tools would try to sneak one of them into the customer environment, modify the runtime settings, get a snapshot, another one... All this with a customer looking over their shoulder. A virtual one (fortunately there's WebEx) but still.
This week I had two support cases that proved that there is a much simpler way. The key is to take a thread dump periodically and see whether there's any pattern.
One of our customers was complaining that after the system runs for several days, its performance degrades. "What is slow?" - "Everything!". A classical hopeless case. The moment you hear it you know the escalation is going to be open for weeks with fruitless attempts to understand what specific scenario is slow and with a disappointed customer in the end. But before you give up, take thread dumps! The more the merrier.
After examining them for a long time without any clue I suddenly noticed something strange. A thread that was doing some database-related activity was in the LinkedList.remove() method called from inside the JDBC driver. This would not be so curious if it didn't repeat itself in almost every thread dump. Every time a thread was doing something JDBC-related, the thread dump caught it in that 'remove' method.
I admit that I forgot for a moment how 'remove' really works and was under the impression that a remove from a linked list is a constant performance operation. It is because of this temporary "blackout" that I was so stunned. What are the chances that every time I take a thread dump I would "catch" a thread in a method that should take nanoseconds?! Something completely bizarre was going on here.
Surely, after having finally read the source of LinkedList I realized that remove was not done using an iterator and hence had to find the object to be removed first, traversing the whole list. From here the conclusion was almost obvious - there is a leak in that list. Some digging in the code of the driver revealed that this list was used only when a certain connection parameter was set (that's why it didn't happen always), but these are the "gory details".
The lesson here is that a pure statistical approach to finding performance hot-spots is not bad at all. Imagine you are running your program in the dark and randomly flash a light that shows you where it is now. If your hot-spot is sufficiently "hot", most of the times you should find your program inside it (by the way, if I'm not mistaken, this is how profilers' sampling feature works).
Needless to say, the second support case I mentioned was very similar. Here it was even simpler because this customer was load-testing our system, so one thread dump was sufficient. Almost every "interesting" thread was at the same place. Case solved.
Lesson learnt - when learning "advanced tools", do not forget the basics. "Advanced tools" only save time, it doesn't mean one is helpless without them.
Excellent post, thanks for sharing!
ReplyDelete