Wednesday, June 25, 2014

mailman Archives Made Useful

I use mutt for all my email needs.  (Not quite true, but I don't count the javascript-fest that is Gmail's interface as an actual mailer.)  It's extremely good at doing what I want it to do:  presenting text in an easily navigable and reasonably easily searchable format.  Yes, I wish I had something that could search my offline mailboxes as efficiently as Gmail's search, but that's a task for another day.

Anyway, a lot of my day is spent working with mailing list traffic.  It's the way of things for a lot of high-tech folks in the modern age.  It's generally a good thing, too.  Often I find myself signing up for a new mailing list for a specific purpose (last week it was to hopefully find answers about strangeness I'm seeing on a board I'm using, this week it was to submit a patch to a project I've been using but until now not contributing to, next week it'll be something else).  When I do that, though, I have a personal code of conduct spawned from one of the touch-stones of my personal philosophy.  Here it is, one of the closest things I have to actual wisdom:

You are not the first one to see this.

There's a lot of corollaries to this theorem and I can think of an excellent counter-example from yesterday, but putting in qualifiers and such (e.g. you are almost certainly not the first one to see this) opens the door to interpretation and short-cutting and, frankly, intellectually lazy behaviour.

So that's why when I go to a new mailing list for something, I make a genuine effort to search the archives before ever posting anything.

That doesn't sound like such a big thing, but searching archives can actually be a significant challenge depending on the list.  In particular, if a list has chosen to use GNU mailman (for reasons that I can only conclude are indicative of inherently anti-social tendencies, because no one, in good conscience, should ever choose it) as the management software, then the "archives", such as they are, are broken down into a nearly useless set of non-searchable HTML-formatted pages, separated by month and year boundaries.  For, what the mailman developers most likely laughingly call, convenience, you can also download a gzip compressed flat text file of each month archive.  Note that these text files, ones uncompressed, are not any standard mailbox format that can be read by civilized mail tools.

But they're close.

In the interests of being completely honest here, I am aware that many mailman installations are configured to actually provide the entire archive in mbox-format (since that is, apparently, how mailman stores the archive anyway) to subscribers provided they use an undocumented URL.  See here for more detail on that.  That does not always work, however.  In about half of the mailman lists I've run into I've found that this feature (the one redeeming feature of mailman, if I may be so bold) is disabled.  So we're back to downloading compressed-not-quite-mbox-files-split-by-month (because a discussion thread never carries over from one month to the next...) and opening them in your favourite text editor.

Or you do what dozens of people have done and you create a tool to turn those text files into something that's actually usuable by mail programs.  Go ahead, search.  You'll find a lot of them.  Shell scripts, python scripts, ruby scripts, perl scripts, probably C programs, likely things written in Haskell and Modula-3 and Emacs-lisp as well if you're of a particularly deviant bent.

This week I tried a half-dozen of these things, all with varying levels of success, all falling below "acceptable".  So I wrote my own.  Here it is:

 wget -l 1 -A .gz -r <archive_url>   
 for i in *.gz ; do  
   gunzip -c $i | sed 's=\(^From.*\) at =\1@=' >> out.mbox  
 done

The primary reason I did this is because I'm tired of constantly hacking this together on the command line and futzing about with getting the correct command-line options or the correct sed expression every single time I run into this scenario.  It happens often enough I'm annoyed by it but not often enough that it's become muscle-memory.  So here we are.  I created a script to wrap the core idea, making it a bit friendlier and less likely to leave garbage lying around my disk.

The resulting script:

 #!/bin/bash  
 #  
 # Copyright (c) 2014, Joe MacDonald <joe@deserted.net>  
 # All rights reserved.  
 #   
 # Redistribution and use in source and binary forms, with or without  
 # modification, are permitted provided that the following conditions are met:  
 #   * Redistributions of source code must retain the above copyright  
 #     notice, this list of conditions and the following disclaimer.  
 #   * Redistributions in binary form must reproduce the above copyright  
 #     notice, this list of conditions and the following disclaimer in the  
 #     documentation and/or other materials provided with the distribution.  
 #   * Neither the name of Joe MacDonald nor the names of its contributors may  
 #     be used to endorse or promote products derived from this software  
 #     without specific prior written permission.  
 #   
 # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND  
 # ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED  
 # WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE  
 # DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY  
 # DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES  
 # (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;  
 # LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND  
 # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS  
 # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
 #   
   
 function dump_help  
 {  
   echo "$0 [-w <wget>] [-m <mailbox>] -u <url> [-a]"  
   echo ""  
   echo "  -w <wget>   Full path to your wget binary"  
   echo "  -m <mailbox>  The filename for the newly created archive mailbox"  
   echo "         Default: /tmp/archive.mbox"  
   echo "  -u <url>    The URL to the mailman archive index page."  
   echo "  -a       Append to an existing mbox."  
   echo ""  
 }  
   
 ARCHIVE_DL_DIR=`mktemp -d -q`  
   
 while getopts "am:u:w:" option  
 do  
   case $option in  
    a ) APPEND="y" ;;  
    m ) MAILBOX="$OPTARG" ;;  
    u ) URL="$OPTARG" ;;  
    w ) WGET="$OPTARG" ;;  
    * ) dump_help ; exit ;;  
   esac  
 done  
   
 # ------------------------------------------------------------------------  
 # if you did no provide a URL, we cannot proceed.  
 # {{{  
 if [ -z "${URL}" ]  
 then  
   echo You must specify the URL of the mailman archive index page.  
   exit  
 fi  
 # }}}  
   
 # ------------------------------------------------------------------------  
 # if there's no wget, we cannot proceed.  
 # {{{  
 if [ -z "${WGET}" ]  
 then  
   WGET=`which wget`  
 fi  
 if [ ! -x ${WGET} ]  
 then  
   echo Wget not found, unable to proceed. If you have installed wget  
   echo in a location not in your PATH, you can try passing the -w option  
   echo to tell the script where to find wget.  
   exit  
 fi  
   
 WGET_OPTS=" \  
   -l 1 \  
   -r \  
   --no-directories \  
   -A .gz \  
   "  
 # }}}  
   
 # ------------------------------------------------------------------------  
 # if we cannot create the new mailbox, we cannot proceed either.  
 # {{{  
 if [ -e ${MAILBOX:=/tmp/archive.mbox} ]  
 then  
   if [ -z "${APPEND}" ]  
   then  
    echo "WARNING: The specified mailbox (${MAILBOX}) already"  
    echo "     exists. This script will append the new mbox contents"  
    echo "     to it. If this is what you want, specify the -a (append)"  
    echo "     option on the comand line and re-run this script."  
    exit  
   fi  
 fi  
 touch ${MAILBOX}  
 if [ ! -w "${MAILBOX}" ]  
 then  
   echo Unable to write to mailbox file: \"${MAILBOX}\".  
   exit  
 else  
   # Since we're changing directories, we'll want to canonicalize this,  
   # otherwise the above check is invalid unless you explicity set a full path  
   # to your new mailbox *or* you took the default.  
   MAILBOX=$(readlink -e ${MAILBOX})  
 fi  
 # }}}  
   
 # ------------------------------------------------------------------------  
 # main  
 # {{{  
 if [ -d ${ARCHIVE_DL_DIR} ]  
 then  
   echo Temporary archive download directory: ${ARCHIVE_DL_DIR}  
   pushd ${ARCHIVE_DL_DIR}  
   wget ${WGET_OPTS} ${URL}  
   for i in *.gz  
   do  
    gunzip -c $i | sed 's=\(^\(From\|Cc\|To\).*\) at =\1@=' >> ${MAILBOX}  
   done  
   popd  
   rm -fr ${ARCHIVE_DL_DIR}  
 else  
   echo Failed to create temporary archive download directory  
   echo Unable to continue. Consider setting TMPDIR to a writable  
   echo location in your environment then re-run the command.  
 fi  
 # }}}