Tuesday, March 12, 2013

find sucks

There are certainly times when find is useful, but nearly every time I end up doing something like this:
# find . -type f -name foo
That's my age showing, probably. Regardless, one of the old 2¢ Tips I remember from back in the day that I hardly ever used until recently was using locatedb to find stuff on your disk.

Yes, there's absolutely a lot of options in "modern" distributions for finding files, often going as far as indexing contents and letting you search a single interface to find them all. That's great for a lot of people, I'm sure. But I'm a software developer most of the time, meaning I would rather have my machine sit idle most of the time than be busy doing something else when I want it to be compiling. Similarly, I need my disk space because I'm going to have a lot of intermediate files. They don't live long, sure, but if I run out of disk space in the middle of a build, that's kinda bad. So I don't want all-singing-all-dancing indexers running in the background (or worse still, on boot) and leaving databases lying around full of stuff I'll never be searching for anyway.

But back to the point.  The old tip that I've only recently started adapting for my own purposes.  The essence of it is this:
Create a custom locatedb containing only the stuff you want to search, update it on demand, then search that.
That's it.  So the simplest possible implementation of the idea is this:
# cd $HOME
# updatedb -o $HOME/.homedir.db -l 0 -U .
That's it.  After that, all subsequent searches are of the form:
# locate -d $HOME/.homedir.db --regex ".*foo$"
Simpler?  Maybe not.  Faster?  Actually, not greatly on a local disk with a small hierarchy.  Consider that by small I mean something like this: 
# locate -S -d $HOME/.homedir.db
Database /home/joe/.homedir.db:
        24,135 directories
        196,394 files
        16,616,477 bytes in file names
        6,598,887 bytes used to store database
For example, my search for all mp3s in my home directory results in this:
skynet ~ time locate -d $HOME/.homedir.db --regex ".*mp3$"
[output deleted]
    1.83s real     1.73s user     0.00s system

skynet ~ time find . -type f -name  "*.mp3"

[output deleted]

    0.55s real     0.23s user     0.28s system
Not really a win.  But let's look at something more "real world".  In my case this is a project source tree.  Any mp3s in there?
turd src time locate -d ./.src-list.db --regex ".*mp3$"
    0.65s real     0.65s user     0.00s system
turd src time find . -type f -name  "*.mp3"
   83.28s real     1.40s user     4.15s system
Nope.  Didn't expect any, but there's the speed-up we were looking for.  FWIW, the database is significantly different than my home directory one:

turd src locate -S -d ./.src-list.db 
Database ./.src-list.db:
40,534 directories
412,043 files
51,894,579 bytes in file names
11,188,602 bytes used to store database
turd src time locate -d ./.wrlinux-list.db --regex ".*layer.conf$"
[output deleted]
    0.56s real     0.56s user     0.00s system
turd src time find . -type f -name "layer.conf"
[output deleted]
    1.25s real     0.60s user     0.65s system

So there, again, we see considerable speed-up.  If you're unfortunate enough to be doing anything over shared filesystems, you're going to see improvements in the range of a couple of orders of magnitude.

Beware:  I'd been using this trick for a while before I realized just how fragmented the *locate world has become.  What I have here works for mlocate but not necessarily for any other version.

Also, note that this isn't quite the same as the old tip that first enlightened me about locate.  That was for indexing CDs (or, more likely, floppies) where you wanted to be able to search a library of them without manually popping each into the drive.  That still works, though I'm hard pressed to think of a practical use for it now that we live in the age of the cloud.  But for the sake for conversation and completeness, here's how that worked.

  1. Mount your CD to a custom location (eg. /mnt/cd-label-1)
  2. updatedb -d $HOME/cd-database.db -l 0 -U /mnt/cd-label-1
  3. search at your leisure as above
Now you need only search the database that the path name will actually tell you what CD your files are on.  Except who backs stuff up to CD anymore in the age of cheap, effectively infinite cloud storage?  Hmm.