Adding Full Text Search to MPP Email Archives
Like Craigslist and others MPP uses the opensource index tool Sphinx for full-text searching of email archives. Sphinx is supported with the metadata storage model - where metadata is stored in database and emails are stored as files - or for full email storage in MySQL.
In the MPP Email Archive Architecture there are four components - Archive Agents, Message Store, Message Processor, Access Server/ The indexing capabilities must be added to the Access Server, aka MPP Manager or Qreview.
This tutorial will guide you how to install and configure Sphinx and related components. We will use the "main+delta" concept to incrementally index the extracted data: once all data from content_index, then only new data every hour.
This guide is Sphinx 0.9.7 and 0.9.8.1 (recommended latest).
Requirements
1) Configure MPP to archive email using the metadata storage. In this mode metadata is stored in MySQL are stored on the file system. MySQL Server 5.0 or higher MUST be accessible, MPP tables should be setup and access should be set. Check NSTALL.mysql for details in setting MPP to work with MySQL.
2) Set MySQL client - server socket location
NOTE: Define [client] section with "socket" option is required when MPP and MySQL server are on the same machine.
MySQL functionality bundled with MPP relies on static MySQL client library. The static library would assume existence of MySQL server socket: /tmp/mysqld.sock, but it is also searching for my.cnf file in /etc/my.cnf.
MySQL client machine (where MPP is installed) requires existence of /etc/my.cnf with the correct path to MySQL socket in [client] section - the same "socket" as in [mysqld] section
Example of /etc/my.cnf configuration:
[mysqld] socket = /var/run/mysqld/mysqld.sock [client] socket = /var/run/mysqld/mysqld.sock
On RedHat/CentOS/Fedora add [client] section with "socket" option defined exactly as in [mysqld] section of /etc/my.cnf. Commenting out #oldpasswords option is recommended too.
On Debian/Gentoo make the following symlink:
ln -s /etc/mysql/my.cnf /etc/my.cnf
3. Set max_allowed_packet
In /etc/my.cnf add/change max_allowed_packet = 64M in [mysqld] section.
Adding support for .doc and .docx
1) MPP will use antiword if "antiword" application exists in PATH to process DOC documents (Word 98-2003). To install Antiword:
wget -c http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz OS X Users: curl -O wget -c http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz tar xzvf antiword-0.37.tar.gz cd antiword-0.37 sudo make -f Makefile.Linux sudo make -f Makefile.Linux global_install
2) To process DOCX (Word 2007) if "docx2txt" application exists in PATH
To install Docx2txt use:
wget -c http://garr.dl.sourceforge.net/sourceforge/docx2txt/docx2txt-0.3.tgz OS X users: curl -O http://garr.dl.sourceforge.net/sourceforge/docx2txt/docx2txt-0.3.tgz tar xzvf docx2txt-0.3.tar.gz cd docx2txt-0.3 sudo make install
Add Support for PDF
To process PDF documents if "pdftotext" application exists in PATH Pdftotext is part of poppler (http://poppler.freedesktop.org/) and you should install the right binaries for your OS.
Mac OS X User: Download pdftotext from here
Processing OpenOffice documents
Processing OpenOffice documents is possible using Openoffice::OODoc module.
On RHEL/Fedora/CentOS where RPMForge repository is in use, one can install using: yum install perl-OODoc
Installing using CPAN is also possible: perl -MCPAN -e shell install OpenOffice::OODoc
Note: OS X users please use /usr/local/mppbase/bin/perl -MCPAN -e shell
Sphinx setup
1) Download content_index.sql and fetchdata.pl
wget -c ftp://ftp.raeinternet.com/pub/mpp4/docs/content_index.sql wget -c ftp://ftp.raeinternet.com/pub/mpp4/scripts/fetchdata.pl (change EDITME to real MySQL password)
Download and Install Sphinx
As root/admin execute the following:
wget -c http://www.sphinxsearch.com/downloads/sphinx-0.9.7.tar.gz tar xzvf sphinx-0.9.7.tar.gz cd sphinx-0.9.7 ./configure --prefix=/usr/local/sphinx
debian users you will need this mysql package: apt-get install libmysql++-dev - equivalent of MySQL devel ./configure --prefix=/usr/local/sphinx --with-mysql-libs=/usr/lib/ --with-mysql-includes=/usr/include/mysql make make install
2) create /usr/local/sphinx/etc/sphinx.conf with the following content
wget -c ftp://ftp.mailspect.com/pub/mpp4/sql/sphinx.conf (change EDITME to real MySQL pass) mv sphinx.conf /usr/local/sphinx/etc/sphinx.conf NOTE: For Sphinx 0.9.8.1 download ftp://ftp.mailspect.com/pub/mpp4/sql/sphinx-0.9.8.1.conf
3) create content_index and content_counter tables if they don't exist in MPP archive DB
mysql -uroot -p mpp_db < content_index.sql
Configure Fetchdata
In this step we will setup fetchdata to access you database and configure the processing. The first run of fetchdata.pl and make initial run may take a long time, be careful. To reduce the load on server Fetchdata.pl can be configured to pace how it processes data by setting my $totalMessages = 10000000;. You will need the MIME::Tools perl module for fetchdata.
/usr/local/mppbase/bin/perl -MCPAN -e shell install MIME::Tools exit
cp fetchdata.pl /usr/local/MPP/scripts/ chmod 755 /usr/local/MPP/scripts/fetchdata.pl /usr/local/MPP/scripts/fetchdata.pl
If you are using Mac OS X:
Replace perl path in fetchdata.pl with '#!/usr/local/mppbase/bin/perl -w'
Modify DB Access and Set Metadata if in use:
# enable use of metadata (1) or not (0) my $metadata = 1;
# messages to process at once ## DBConnectiomy $strDataBase="mpp_archive"; my $strDataBase="mpp"; my $strHostName="localhost"; my $strPort=""; my $strUser="raempp"; my $strPassword="raempp";
OS X Users may see these messages when running fetchdata.pl. They are harmless bash-3.2# /usr/local/MPP/scripts/fetchdata.pl Constant subroutine main::SEEK_SET redefined at /usr/local/MPP/scripts/fetchdata.pl line 31 Constant subroutine main::SEEK_END redefined at /usr/local/MPP/scripts/fetchdata.pl line 31 Constant subroutine main::SEEK_CUR redefined at /usr/local/MPP/scripts/fetchdata.pl line 31
Initialize Sphinx Index
The index is created with output of fetchdata.pl
/usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf --all
Schedule Index Updates in Cron
1) Now schedule fetchdata in cron:
crontab -e 5 * * * * /usr/local/MPP/scripts/fetchdata.pl >/dev/null 2>&1 </dev/null
2) Add a cronjob to update mppdeltaindex every hour
45 * * * * /usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf mppdeltaindex --rotate >/dev/null 2>&1 </dev/null
Start Sphinx Daemon (searchd)
Before running serachd you must build the index. The index is built from the output of fetchdata.pl. Running fetchdata.pl and the index scripts can take a long time so take this into consideration.
1) Create both Sphinx indexes: mppindex and mppdeltaindex
/usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf mppdeltaindex
2) Start Sphinx Daemon
/usr/local/sphinx/bin/searchd --config /usr/local/sphinx/etc/sphinx.conf
We suggest to add previous line in your local startup script:
RH/Fedora/CentOS: /etc/rc.local Debian: /etc/rc.local Slackware: /etc/rc.d/rc.local Gentoo: /etc/conf.d/local.start
Verify Installation
Now, we are able to perform full text searches for command line (it will extract messages were there matches too for verification)
/usr/local/sphinx/bin/search -c /usr/local/sphinx/etc/sphinx.conf ovidiu
More info can be found here: http://www.sphinxsearch.com/doc.html
NOTE: Once per month disable temporary indexer cronjob and rebuild main index: /usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf --all --rotate
Rebuilding Full Text Index
Please see Rebuilding Sphinx Index for instructions for rebuilding the index.
Set in MPP Manager
In http://youhost:20000 Setup -> Module Config set the following:
sphinx_dir /usr/local/sphinx
Notes
If you have multiple front-ends to your email archive this process must be replicated on each server.
All of this is pre-configured in the MPP Virtual Appliance
