Diving into Python Memory to track leak and ballooning

18 Sep 2014

Memory leaks in python can’t really happen due to the work done by the garbage collector. However, ballooning is still possible, and cyclic references sometimes are hard to track. Let’s see with a real example how these tools can be useful to track and fix these memory issues.

Why this post ?

As a pet project I’m contributing to Archipel a virtual machine orchestrator based on libvirt and using XMPP as an underlying communication system (which is really cool !).

This software is based on xmpppy a XMPP python library no longer maintained nor developed but pretty easy to use and flexible enough to make everything we wanted to.

I notice that the main daemons (archipel-agent and archipel-central-agent ) were really eating more and more memory accros the time and sometime took all the available memory (32Gb for a python daemon is pretty huge regarding what was really done).

So I needed some tools to see what was really stored in the memory and why it was endlessly growing.

What's in my Toolbox

A small compilation of tools I found:

objgraph: A super useful module to visually explore Python object graphs
guppy: A Heap Analysis Toolset
ipdb/ipython: A Powerful interactive python shell

Bring it on !

This issue is quite hard to reproduce as we have to wait long enough to see what is eating the memory.

So first thing was to add an ipdb trace handler (so I can fire up ipdb console whenever I wanted to) by adding the following code to the main program loop:

    sig.signal(sig.SIGTERM, exit_handler)
    import signal,resource
    try:
        import ipdb as pdb
    except:
        import pdb

    def handle_signal(signal_number, frame_stack):
        import objgraph, resource
        print 'Memory : %s (kb)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss))
        pdb.set_trace()

    signal.signal(signal.SIGINT, handle_signal)

This way, each time I hit ctrl+c, it raises an ipdb console (and also showing the actual state of memory).

From this prompt, I can use whatever python thing to track and understand where the issue is hiding.

TIP: to be able to write multilines function I can run the following inside my ipdb prompt type:

!import code; code.interact(local=vars())

From there I can start to inspect the memory using guppy and create some useful graph using objgraph.

Clean the dust

First of all, I should clean everything that I can using the garbage collector and only after that I can have some information about the memory taken by my process.

>>> import gc
>>> gc.collect()
6551
>>> resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
64316

Now let’s dig on memory.

These are not the droids you are looking for...

With guppy I can print the current heap:

In : import guppy
In : hp=guppy.hpy()
In : h= hp.heap()
In : h
Partition of a set of 343006 objects. Total size = 59627592 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 115457  34 12238120  21  12238120  21 str
     1  36156  11 11455008  19  23693128  40 dict (no owner)
     2   7591   2  7955368  13  31648496  53 dict of xmpp.simplexml.Node
     3  55817  16  4811344   8  36459840  61 tuple
     4  23899   7  3405328   6  39865168  67 unicode
     5   2439   1  2556072   4  42421240  71 dict of xmpp.protocol.Iq
     6  23194   7  2505488   4  44926728  75 list
     7    588   0  1931808   3  46858536  79 dict of module
     8  14401   4  1728120   3  48586656  81 types.CodeType
     9  14104   4  1692480   3  50279136  84 function
<830 more rows. Type e.g. '_.more' to view.>

As you can see, there is a lot of strings/dicts... but hold on, what am I really looking for ?

My daemons are quite simple: they manipulate XML objects, receive and send XMPP stanzas (which are also XML objects) and that's all, so I will dig on these items.

A Stanza can be a Message or an IQ and composed from Nodes, Nodes contains dicts and dicts contains strings.

As we can see I have Node and IQ taking a big part of memory. Let’s dig on it.

In : h[5].domisize
25025512

It seems we have 25MB of IQ stored in memory, this is huge ! (We do not keep any sort of history)

In : h[5].byid
Set of 2439 <dict of xmpp.protocol.Iq> objects. Total size = 2556072 bytes.
 Index     Size   %   Cumulative  %   Owner Address
     0     1048   0.0      1048   0.0 0x19446d0
     1     1048   0.0      2096   0.1 0x1f9f790
     2     1048   0.0      3144   0.1 0x23e5750
     3     1048   0.0      4192   0.2 0x25517d0
     4     1048   0.0      5240   0.2 0x2551ad0
     5     1048   0.0      6288   0.2 0x2551b90
     6     1048   0.0      7336   0.3 0x2a75dd0
     7     1048   0.0      8384   0.3 0x2a75f10
     8     1048   0.0      9432   0.4 0x2a961d0
     9     1048   0.0     10480   0.4 0x2a96790
<2429 more rows. Type e.g. '_.more' to view.>

2500 object of size 1048, interesting… Let’s use objgraph now to get more information about relationships.

With objgraph I can walk through object types like this:

import objgraph
In : objgraph.show_most_common_types()
dict                       59731
tuple                      56281
list                       24216
function                   14161
Node                       7857
instance                   5824
instancemethod             3791
weakref                    2704
Iq                         2512
builtin_function_or_method 1759

So here we can see that I have a lot of Iq objects as we saw previously. Let’s see what they are:

In: for iq in objgraph.by_type('Iq'): print iq
...

---snip---

<iq to="admin@central-server.archipel.priv/ArchipelController" from="agent-1.archipel.priv@central-server.archipel.priv/agent-1.archipel.priv" id="9394860" type="result"><query xmlns="archipel:xmppserver"><users xmpp="True" xmlrpc="False" /><groups xmpp="False" xmlrpc="True" /></query></iq>
<iq to="admin@central-server.archipel.priv/ArchipelController" from="agent-1.archipel.priv@central-server.archipel.priv/agent-1.archipel.priv" id="9396011" type="result"><query xmlns="archipel:xmppserver"><users xmpp="True" xmlrpc="False" /><groups xmpp="False" xmlrpc="True" /></query></iq>

---snip---

It's Iq responses… there are a lot of them almost identical (except the id) they should be cleaned by the GC, but it's not the case, they must have some pending references, let's see which are the holders.

Objgraph can generate pretty useful graphes regarding references:

In : objgraph.show_backrefs(objgraph.by_type('Iq')[0], filename="/vagrant/g.png",refcounts=True, max_depth=5)
Graph written to /tmp/objgraph-a08L96.dot (10 nodes)
Image generated as /vagrant/g.png

Here is the result:

Back Refences on IQ object

Oh a Cyclic reference (in red) ! Normally the garbage collector should take care of that, but it is strongly recommended to deal with them at the source. There are plenty of ways to fix this, I choose to use the Weakref.

Parent object are now weakref.proxy and the garbage collector can clean everything easily as you can see here

Straight to the root

I thought fixing the cyclic reference issue would fix all the memory eating but it does not. Something is still eating my memory ! And using the same tools as before I just figured out that I still have tons of IQ hanging in memory. Using objgraph again helps a lot to find the real source:

In : objgraph.show_backrefs(objgraph.by_type('Iq')[:3], filename="/vagrant/g.png",refcounts=True, max_depth=5)
Graph written to /tmp/objgraph-NZsvZL.dot (24 nodes)
Image generated as /vagrant/g.png

Back Refences on Several IQ object

This is interesting… but wait, I also have a bunch of Messages hanging (Message is a type of IQ). Let’s see what’s behind and if it’s related to the previous one:

In : objgraph.show_backrefs(objgraph.by_type('Message')[:5],max_depth=8, filename="/vagrant/g.png")
Graph written to /tmp/objgraph-Dhqhdu.dot (58 nodes)
Image generated as /vagrant/g.png

Back Refences on Several Messages object

What a mess !

As we can see, the Dispatcher instance keeps a list of every stanza dispatched (and actually most of them are related to callbacks).

For the record, a closure contains a cell which contains details of the variables defined in the enclosing scope.

So as we can read in this diagram, xmpp.dispatcher.Dispatcher instance, has an dict attribute named _expected and this attribute contain every IQ/Message (identified with ID as key) sent and pending for a response.

I took a look to the xmpppy library and I found that in specific case of the Dispatcher.Dispatch, the session._expected[ID] was never cleaned. I smartly add a del session._expected[ID] here just after all have been processed and tada, no more memory leak !

Are we done ?

After stressing a bit the new code, I notice that some objects are still in memory:

In : print 'Memory : %s (kb) with %s Node %s iq %s message %s dict' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss,objgraph.count('Node'),objgraph.count('Iq'),objgraph.count('Message'),objgraph.count('dict') )
Memory : 115188 (kb) with 377 Node 0 iq 0 message 11235 dict

We can see that our previous fix works since we don't have any IQ/Messages hanging.

But I still have Nodes, I know that we are using the simplexml object to deal with XML files (libvirt domain definition for example) so let’s see what these nodes are.

First I try to filter the parent nodes (as nodes can have leaf nodes and parent nodes):

In : topNodes=[]
In : for node in objgraph.by_type('Node'):
...:     if not node.parent:
...:         topNodes.append(node)
In : len(topNodes)
70

So we have 70 Nodes left (out of 377) without parents, let’s sort them and count them by .name attribute

In:  topNodesByName=sorted(topNodes, key=lambda node: node.name)
In : for item in topNodesByName:
...:     if str(item.name) not in NodeCount.keys():
...:             NodeCount[str(item.name)] = 1
...:     NodeCount[str(item.name)] += 1
In : NodeCount
{'domain': 5, 'features': 6, 'success': 6, 'iq': 11, 'capabilities': 2, 'item': 34, 'groups': 2, 'stream:stream': 11, 'users': 2}

We see that Node with item as name represent almost 50% of the Nodes, let’s see how they are referenced:

Gather the item together:

In : items=[]
In : for item in topNodesByName:
...:    if item.name == 'item':
...:            items.append(item)

Let’s pick one and find its index:

In: objgraph.by_type('Node').index(items[0])
93
In: del topNodes
In: del topNodeByName
In: del items

Note: before generating graph I must delete every intermediate values (topNodes….) as they could appear in the graph also...

Now generate the chain of back references:

In : objgraph.show_chain(
...:     objgraph.find_backref_chain(
...:         objgraph.by_type('Node')[93],
...:         objgraph.is_proper_module),
...:     filename='/vagrant/chain.png')

Graph written to /tmp/objgraph-cujzXy.dot (13 nodes)
Image generated as /vagrant/chain.png

Chain of back Refences on an item

We can see that the TNPubSubNode instance is holding these Nodes. By having at the code we figure out that theses Nodes are the PubSub Events retrieved. Theses are stored to avoid fetch everything each time (and by the way the limit set is 1000 so it will never grow endlessly).

To be sure that we tracked all wanted too, I check with guppy to see what remains in memory:

In : import guppy
In : hp = guppy.hpy()
In : h = hp.heap()
In : h
Partition of a set of 241435 objects. Total size = 34193088 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 119302  49 12690504  37  12690504  37 str
     1  53350  22  4633584  14  17324088  51 tuple
     2   4954   2  2573296   8  19897384  58 dict (no owner)
     3    588   0  1931808   6  21829192  64 dict of module
     4  14400   6  1728000   5  23557192  69 types.CodeType
     5  14109   6  1693080   5  25250272  74 function
     6   1318   1  1187584   3  26437856  77 type
     7   1318   1  1138192   3  27576048  81 dict of type
     8   2079   1   944200   3  28520248  83 unicode
     9   3902   2   717680   2  29237928  86 list
<832 more rows. Type e.g. '_.more' to view.>

All seems fine, we have a lot of str and dict but nothing relevant.

Final word

It could sound like an easy fix, but these issues gave me a hard time...

YAG'B /yag-bee/ noun: Yet Another Geek Blog Dev Hard Misc About