Anyway, a lot of my day is spent working with mailing list traffic. It's the way of things for a lot of high-tech folks in the modern age. It's generally a good thing, too. Often I find myself signing up for a new mailing list for a specific purpose (last week it was to hopefully find answers about strangeness I'm seeing on a board I'm using, this week it was to submit a patch to a project I've been using but until now not contributing to, next week it'll be something else). When I do that, though, I have a personal code of conduct spawned from one of the touch-stones of my personal philosophy. Here it is, one of the closest things I have to actual wisdom:
You are not the first one to see this.
There's a lot of corollaries to this theorem and I can think of an excellent counter-example from yesterday, but putting in qualifiers and such (e.g. you are almost certainly not the first one to see this) opens the door to interpretation and short-cutting and, frankly, intellectually lazy behaviour.
So that's why when I go to a new mailing list for something, I make a genuine effort to search the archives before ever posting anything.
That doesn't sound like such a big thing, but searching archives can actually be a significant challenge depending on the list. In particular, if a list has chosen to use GNU mailman (for reasons that I can only conclude are indicative of inherently anti-social tendencies, because no one, in good conscience, should ever choose it) as the management software, then the "archives", such as they are, are broken down into a nearly useless set of non-searchable HTML-formatted pages, separated by month and year boundaries. For, what the mailman developers most likely laughingly call, convenience, you can also download a gzip compressed flat text file of each month archive. Note that these text files, ones uncompressed, are not any standard mailbox format that can be read by civilized mail tools.
But they're close.
In the interests of being completely honest here, I am aware that many mailman installations are configured to actually provide the entire archive in mbox-format (since that is, apparently, how mailman stores the archive anyway) to subscribers provided they use an undocumented URL. See here for more detail on that. That does not always work, however. In about half of the mailman lists I've run into I've found that this feature (the one redeeming feature of mailman, if I may be so bold) is disabled. So we're back to downloading compressed-not-quite-mbox-files-split-by-month (because a discussion thread never carries over from one month to the next...) and opening them in your favourite text editor.
Or you do what dozens of people have done and you create a tool to turn those text files into something that's actually usuable by mail programs. Go ahead, search. You'll find a lot of them. Shell scripts, python scripts, ruby scripts, perl scripts, probably C programs, likely things written in Haskell and Modula-3 and Emacs-lisp as well if you're of a particularly deviant bent.
This week I tried a half-dozen of these things, all with varying levels of success, all falling below "acceptable". So I wrote my own. Here it is:
wget -l 1 -A .gz -r <archive_url>
for i in *.gz ; do
gunzip -c $i | sed 's=\(^From.*\) at =\1@=' >> out.mbox
done
The primary reason I did this is because I'm tired of constantly hacking this together on the command line and futzing about with getting the correct command-line options or the correct sed expression every single time I run into this scenario. It happens often enough I'm annoyed by it but not often enough that it's become muscle-memory. So here we are. I created a script to wrap the core idea, making it a bit friendlier and less likely to leave garbage lying around my disk.
The resulting script:
#!/bin/bash
#
# Copyright (c) 2014, Joe MacDonald <joe@deserted.net>
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of Joe MacDonald nor the names of its contributors may
# be used to endorse or promote products derived from this software
# without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
function dump_help
{
echo "$0 [-w <wget>] [-m <mailbox>] -u <url> [-a]"
echo ""
echo " -w <wget> Full path to your wget binary"
echo " -m <mailbox> The filename for the newly created archive mailbox"
echo " Default: /tmp/archive.mbox"
echo " -u <url> The URL to the mailman archive index page."
echo " -a Append to an existing mbox."
echo ""
}
ARCHIVE_DL_DIR=`mktemp -d -q`
while getopts "am:u:w:" option
do
case $option in
a ) APPEND="y" ;;
m ) MAILBOX="$OPTARG" ;;
u ) URL="$OPTARG" ;;
w ) WGET="$OPTARG" ;;
* ) dump_help ; exit ;;
esac
done
# ------------------------------------------------------------------------
# if you did no provide a URL, we cannot proceed.
# {{{
if [ -z "${URL}" ]
then
echo You must specify the URL of the mailman archive index page.
exit
fi
# }}}
# ------------------------------------------------------------------------
# if there's no wget, we cannot proceed.
# {{{
if [ -z "${WGET}" ]
then
WGET=`which wget`
fi
if [ ! -x ${WGET} ]
then
echo Wget not found, unable to proceed. If you have installed wget
echo in a location not in your PATH, you can try passing the -w option
echo to tell the script where to find wget.
exit
fi
WGET_OPTS=" \
-l 1 \
-r \
--no-directories \
-A .gz \
"
# }}}
# ------------------------------------------------------------------------
# if we cannot create the new mailbox, we cannot proceed either.
# {{{
if [ -e ${MAILBOX:=/tmp/archive.mbox} ]
then
if [ -z "${APPEND}" ]
then
echo "WARNING: The specified mailbox (${MAILBOX}) already"
echo " exists. This script will append the new mbox contents"
echo " to it. If this is what you want, specify the -a (append)"
echo " option on the comand line and re-run this script."
exit
fi
fi
touch ${MAILBOX}
if [ ! -w "${MAILBOX}" ]
then
echo Unable to write to mailbox file: \"${MAILBOX}\".
exit
else
# Since we're changing directories, we'll want to canonicalize this,
# otherwise the above check is invalid unless you explicity set a full path
# to your new mailbox *or* you took the default.
MAILBOX=$(readlink -e ${MAILBOX})
fi
# }}}
# ------------------------------------------------------------------------
# main
# {{{
if [ -d ${ARCHIVE_DL_DIR} ]
then
echo Temporary archive download directory: ${ARCHIVE_DL_DIR}
pushd ${ARCHIVE_DL_DIR}
wget ${WGET_OPTS} ${URL}
for i in *.gz
do
gunzip -c $i | sed 's=\(^\(From\|Cc\|To\).*\) at =\1@=' >> ${MAILBOX}
done
popd
rm -fr ${ARCHIVE_DL_DIR}
else
echo Failed to create temporary archive download directory
echo Unable to continue. Consider setting TMPDIR to a writable
echo location in your environment then re-run the command.
fi
# }}}
No comments:
Post a Comment