Thursday, July 25, 2013

Stuipd shell trick (or: who's been up-chucking all over my /tmp?)

This isn't actually something going mental on my own machine, though it could very well have been, I suppose.  I'm doing a lot of work with bitbake these days for work.  One of the features I've been spending time with is the PR Service, which, in the absence of any other configuration changes, starts up a process that queries and feeds an sqlite3 database with hashes and revision numbers.  The intent is to get around problems that arise when you don't properly update the PR value in your recipe.

And if that's all gibberish to you, don't worry, that's just setting the stage.

Here's the thing.  If the server doesn't shut down cleanly -- for whatever reason -- it leaves a pidfile in /tmp.
Aside: I briefly considered providing a patch to bitbake that would move the pidfile out of /tmp mostly on religious grounds. I'm offended by anything that leaves plain files in /tmp as a matter of course, but I opted to not bother since I didn't think most people would care or would welcome the change if I did submit it and really, so long as you're not having problems with your build, bitbake actually does clean up after itself and removes the pidfile. The problem is, of course, the only reason I get dragged into this stuff is when builds don't go well. So, observer's bias, I guess.
So then I log in to a particularly ill machine and discover this:
% ls -l /tmp/PRServer_127.0.0.1_* | wc -l599
... yeah.  We do not have 599 active PR Servers running on this machine.  But it's also, let's say, challenging to find anything I actually happen to care about in /tmp now, thanks to this heap of detritus.

Ah.  At last we're getting near the point.

The pidfile has, encoded in the name, the PID of the server.  So for starters, let's collect up a list of potential PIDs.  I figure that's a good first step, then I can look to see if those PIDs are still active and, if they are, if they happen to be a PR Service (since given the length of time this has obviously been running amok, there's no reason to assume that an active PID is actually the same one that was given to the PR Server in question).

First step is easy.  Of course I typed this all on the command line, I don't actually create scripts ... well, much of ever, honestly.  I create shell functions more often than that by far, but I don't even bother enshrining most of these quick hacks in functions.  If they age out of my shell history, I recreate them from scratch.  There's no magic here, anyway.  But for the sake of readability, let's look at this as if it were a script or function:
for i in /tmp/PRServer_127.0.0.1_*
do
    cat $i
done
Why not just 'cat /tmp/PRServer_127.0.0.1_*' you ask?  Because that destroys any ability I have to operate on individual files.  Bear with me.

Right.  Got a list of PIDs now.  Conveniently, 599 of 'em.  Now what?

My first thought was send something relatively benign like SIGCONT to each, check the return value and hopefully harvest some info from each of them at the same time about their command line so I could tell which of the active ones were still the processes I was interested in.  That would look something like this:
for i in /tmp/PRServer_127.0.0.1_*
do
    for j in $(cat $i)
    do
        if [[ kill -SIGCONT $j ]]
        then
            # appears to be an active process, check it
            if [ -n "$(ps aux | grep $j | grep -v grep | grep prserv)" ]
            then
                # active and a PR server, probably the right one.
            else
                # remove the dead pidfile
                rm /tmp/PRServer_127.0.0.1_$j.pid
            fi
        fi
    done
done
Don't use that.  It almost certainly won't work.  I never even typed it into the shell.

I got thinking a bit more about this.  Considered using -p rather than SIGCONT, googled a bit for some tool to help, then thought:  this is dumb, I have the info I need in the shell.  So here's what I ended up with:
for i in /tmp/PRServer_127.0.0.1_*
do
    for j in $(cat $i)
    do
        if [ -d /proc/$j ]
        then
            if grep -q bitbake /proc/$j/cmdline
            then
                continue
            fi
        fi
        rm $i
    done
done
So how'd that work out?
% ls -l /tmp/PRServer_127.0.0.1_* | wc -l9
Still a suspiciously large number, but one I can live with.
Turns out, though, there's another (in this case better) way to do it.  The pidfile contains a port number where the server is listening.  If there's nothing still listening on that port, it's safe to assume the server is dead.
for i in /tmp/PRServer_127.0.0.1_*
do
    for j in $(basename $i | awk -F_ '{ print $3 }' | cut -f1 -d.)
    do
        if netstat -l | grep -q localhost:$j
        then
            continue
        else
            rm $i
        fi
    done
done
This relies on behaviour of the PR Service, though, whereas my first attempt works for any ridiculously large list of processes.  I think I like it better, though, because I like anything that lets me bring awk into the equation.

No comments: