Friday, August 20, 2010

The simplest hot-spot detection tool

How do you find a performance problem in a working system at a customer site? The ones used to running those sophisticated profiler tools would try to sneak one of them into the customer environment, modify the runtime settings, get a snapshot, another one... All this with a customer looking over their shoulder. A virtual one (fortunately there's WebEx) but still.

This week I had two support cases that proved that there is a much simpler way. The key is to take a thread dump periodically and see whether there's any pattern.

One of our customers was complaining that after the system runs for several days, its performance degrades. "What is slow?" - "Everything!". A classical hopeless case. The moment you hear it you know the escalation is going to be open for weeks with fruitless attempts to understand what specific scenario is slow and with a disappointed customer in the end. But before you give up, take thread dumps! The more the merrier.

After examining them for a long time without any clue I suddenly noticed something strange. A thread that was doing some database-related activity was in the LinkedList.remove() method called from inside the JDBC driver. This would not be so curious if it didn't repeat itself in almost every thread dump. Every time a thread was doing something JDBC-related, the thread dump caught it in that 'remove' method.

I admit that I forgot for a moment how 'remove' really works and was under the impression that a remove from a linked list is a constant performance operation. It is because of this temporary "blackout" that I was so stunned. What are the chances that every time I take a thread dump I would "catch" a thread in a method that should take nanoseconds?! Something completely bizarre was going on here.

Surely, after having finally read the source of LinkedList I realized that remove was not done using an iterator and hence had to find the object to be removed first, traversing the whole list. From here the conclusion was almost obvious - there is a leak in that list. Some digging in the code of the driver revealed that this list was used only when a certain connection parameter was set (that's why it didn't happen always), but these are the "gory details".

The lesson here is that a pure statistical approach to finding performance hot-spots is not bad at all. Imagine you are running your program in the dark and randomly flash a light that shows you where it is now. If your hot-spot is sufficiently "hot", most of the times you should find your program inside it (by the way, if I'm not mistaken, this is how profilers' sampling feature works).

Needless to say, the second support case I mentioned was very similar. Here it was even simpler because this customer was load-testing our system, so one thread dump was sufficient. Almost every "interesting" thread was at the same place. Case solved.

Lesson learnt - when learning "advanced tools", do not forget the basics. "Advanced tools" only save time, it doesn't mean one is helpless without them.

Friday, August 13, 2010

On Infrastructure Teams


Prologue

Top-level manager: Why it took so long for your developers to implement this feature?!
Second-level manager 1: They had to invest in the infrastructure because it wasn't flexible enough to support  this feature.
Top-level manager: Hmm.. And you? Why do you have a delay?
Second-level manager 2: We discovered that our component foo had a bug - my engineers spent a day brainstorming how to tackle it.
Top-level manager: This foo component, doesn't it do just what your bar component is doing?
Second-level manager 1: Well, almost, but we don't have a common codebase.
Top-level manager: I think I have a great idea! We should build an infrastructure team! We will put all smart folks there and they will write these components once and for all! We shall save a lot of resources by doing these components only once!

[A year later]

Developer: What do you think about adding this functionality to our system?
Team lead: Excellent idea! But wait!.. For this the infra component we are using must be changed. I'll see if I can ask them to add this nice feature.
Infra team lead: Sounds cool! Do you have a written spec? No? You better write one, so we could  understand better what you need. We also would be able to show it to other teams so they could have their say. That's a great suggestion and I even think we should make something much more general out of it! I shall check our work plans, it looks like in June we will be able to start working on this. Great, keep coming with ideas like this! And don't forget to bring the spec!

[Another year has passed]


Top-level manager: Amazing, I didn't expect you to complete this so fast!
Second-level manager: It's because we decided to stop using the infra component and our engineers made something much lighter and faster in just a week! What they wrote is exactly what we need, without the overhead that component incurred.

[Couple of months later]


Customer support manager: Wait! I remember we solved this problem a year ago, how could it be it happened again?!
Team lead: Well, since then we stopped using that infra component and have implemented our own...

* * *

Sounds familiar? I believe every organization goes through the cycles like this. We, software developers, hate copy-n-paste. We love reuse. We cannot stand having two similar pieces of code without trying to refactor them out to a common component. And the things we especially love to write are infrastructural components!

This is how the infrastructure teams are born. It seems so natural! It's cool to work in such teams! But just as the area of influence of this team expands, the more rigid it becomes. If it tries to serve only one project, the others get angry at it. If it tries to satisfy everyone, it finds it difficult to change anything. No single project has any sure way of influencing its work plans. It becomes unresponsive. And then the projects find (or invent) reasons to stop using the services of this team. And then it starts again.

It's interesting to note that this happens at all levels. Shall we generalize this class to support this case as well? Wouldn't it make it too cumbersome and hard to support? Shall we use that third-party library? Looks like it does what we need, but it does so much more! Wouldn't it incur an overhead? Shouldn't we create a group that is responsible for various security-related components? Should we use the services of IT or support our own servers? Etc. etc.

To use or to develop? It's an eternal dilemma in software. Like always, the art is to combine the two approaches. Choose rigorously what to utilize without succumbing to the NIH syndrome. When I face a choice of adopting a third-party component vs. developing one internally I try to ask myself the question "what was the primary use-case the OTS component's developers had in mind"? Does it match ours? Whether what we need is the primary purpose of the component or not? Even if what you need is served well by the component, but it is only one of the dozen areas it covers, you probably shouldn't take it. The size generally incurs a high learning curve and a serious expertise. The price may not be justified. The size and the complexity of the component should just match the complexity of the problem you are trying to solve.

Another indicative question is whether the problem you are trying to solve lies on your primary path. For instance, you shouldn't develop your own build framework, unless this is your primary product. You should take an existing one and try to adapt yourself to its principles. However, never make compromises when it comes to your primary expertise. 

So, do we need the infrastructure teams? It seems that again there's no good answer. We do. We don't. We create them. We hate them. Open source seems to be the best viable alternative. I don't necessarily mean internet-based open source. It may be an internal one. But with a possibility for the consumers to actively influence (provided they don't break existing tests) and a possibility for the central authority to supervise. Whenever there is a monopoly on changes in a component, it provokes antagonism among its users. Especially when they are sufficiently skilled to develop a better one by themselves.