close
Mailspect Documentation
README.sphinx

Contents

Adding Full Text Search to MPP Email Archives

Like Craigslist and others MPP uses the opensource index tool Sphinx for full-text searching of email archives. Sphinx is supported with the metadata storage model - where metadata is stored in database and emails are stored as files - or for full email storage in MySQL.

In the MPP Email Archive Architecture there are four components - Archive Agents, Message Store, Message Processor, Access Server/ The indexing capabilities must be added to the Access Server, aka MPP Manager or Qreview.

This tutorial will guide you how to install and configure Sphinx and related components. We will use the "main+delta" concept to incrementally index the extracted data: once all data from content_index, then only new data every hour.

This guide is Sphinx 0.9.7 and 0.9.8.1 (recommended latest).

Requirements

1) Configure MPP to archive email using the metadata storage. In this mode metadata is stored in MySQL are stored on the file system. MySQL Server 5.0 or higher MUST be accessible, MPP tables should be setup and access should be set. Check NSTALL.mysql for details in setting MPP to work with MySQL.

2) Set MySQL client - server socket location

NOTE: Define [client] section with "socket" option is required when MPP and MySQL server are on the same machine.

MySQL functionality bundled with MPP relies on static MySQL client library. The static library would assume existence of MySQL server socket: /tmp/mysqld.sock, but it is also searching for my.cnf file in /etc/my.cnf.

MySQL client machine (where MPP is installed) requires existence of /etc/my.cnf with the correct path to MySQL socket in [client] section - the same "socket" as in [mysqld] section

Example of /etc/my.cnf configuration:

[mysqld]
socket = /var/run/mysqld/mysqld.sock
[client]
socket = /var/run/mysqld/mysqld.sock

On RedHat/CentOS/Fedora add [client] section with "socket" option defined exactly as in [mysqld] section of /etc/my.cnf. Commenting out #oldpasswords option is recommended too.

On Debian/Gentoo make the following symlink:

ln -s /etc/mysql/my.cnf /etc/my.cnf


3. Set max_allowed_packet

In /etc/my.cnf add/change max_allowed_packet = 64M in [mysqld] section.

Adding support for .doc and .docx

1) MPP will use antiword if "antiword" application exists in PATH to process DOC documents (Word 98-2003). To install Antiword:

wget -c http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
OS X Users: curl -O wget -c http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
tar xzvf antiword-0.37.tar.gz
cd antiword-0.37
sudo make -f Makefile.Linux
sudo make -f Makefile.Linux global_install


2) To process DOCX (Word 2007) if "docx2txt" application exists in PATH To install Docx2txt use:

wget -c http://garr.dl.sourceforge.net/sourceforge/docx2txt/docx2txt-0.3.tgz
OS X users: curl -O http://garr.dl.sourceforge.net/sourceforge/docx2txt/docx2txt-0.3.tgz
tar xzvf docx2txt-0.3.tar.gz
cd docx2txt-0.3
sudo make install

Add Support for PDF

To process PDF documents if "pdftotext" application exists in PATH Pdftotext is part of poppler (http://poppler.freedesktop.org/) and you should install the right binaries for your OS.

Mac OS X User:  Download pdftotext from here

Processing OpenOffice documents

Processing OpenOffice documents is possible using Openoffice::OODoc module.

On RHEL/Fedora/CentOS where RPMForge repository is in use, one can install using: yum install perl-OODoc
Installing using CPAN is also possible:
perl -MCPAN -e shell
install OpenOffice::OODoc
Note: OS X users please use  /usr/local/mppbase/bin/perl -MCPAN -e shell

Sphinx setup

1) Download content_index.sql and fetchdata.pl

wget -c ftp://ftp.raeinternet.com/pub/mpp4/docs/content_index.sql
wget -c ftp://ftp.raeinternet.com/pub/mpp4/scripts/fetchdata.pl (change EDITME to real MySQL password)

Download and Install Sphinx

As root/admin execute the following:

wget -c http://www.sphinxsearch.com/downloads/sphinx-0.9.7.tar.gz
tar xzvf sphinx-0.9.7.tar.gz
cd sphinx-0.9.7
./configure --prefix=/usr/local/sphinx
debian users
 you will need this mysql package:
 apt-get install libmysql++-dev - equivalent of MySQL devel
./configure --prefix=/usr/local/sphinx --with-mysql-libs=/usr/lib/ --with-mysql-includes=/usr/include/mysql

 make 
 make install 

2) create /usr/local/sphinx/etc/sphinx.conf with the following content

wget -c ftp://ftp.mailspect.com/pub/mpp4/sql/sphinx.conf   (change EDITME to real MySQL pass)
mv sphinx.conf /usr/local/sphinx/etc/sphinx.conf 

NOTE: For Sphinx 0.9.8.1 download ftp://ftp.mailspect.com/pub/mpp4/sql/sphinx-0.9.8.1.conf


3) create content_index and content_counter tables if they don't exist in MPP archive DB

mysql -uroot -p mpp_db < content_index.sql

Configure Fetchdata

In this step we will setup fetchdata to access you database and configure the processing. The first run of fetchdata.pl and make initial run may take a long time, be careful. To reduce the load on server Fetchdata.pl can be configured to pace how it processes data by setting my $totalMessages = 10000000;. You will need the MIME::Tools perl module for fetchdata.


/usr/local/mppbase/bin/perl -MCPAN -e shell 
install MIME::Tools
exit


cp fetchdata.pl  /usr/local/MPP/scripts/
chmod 755 /usr/local/MPP/scripts/fetchdata.pl
/usr/local/MPP/scripts/fetchdata.pl

If you are using Mac OS X:

Replace perl path in fetchdata.pl with  '#!/usr/local/mppbase/bin/perl -w'

Modify DB Access and Set Metadata if in use:

# enable use of metadata (1) or not (0)
my $metadata = 1;
# messages to process at once
## DBConnectiomy $strDataBase="mpp_archive";
my $strDataBase="mpp";
my $strHostName="localhost";
my $strPort="";
my $strUser="raempp";
my $strPassword="raempp";


OS X Users may see these messages when running fetchdata.pl.  They are harmless
bash-3.2# /usr/local/MPP/scripts/fetchdata.pl
Constant subroutine main::SEEK_SET redefined at /usr/local/MPP/scripts/fetchdata.pl line 31
Constant subroutine main::SEEK_END redefined at /usr/local/MPP/scripts/fetchdata.pl line 31
Constant subroutine main::SEEK_CUR redefined at /usr/local/MPP/scripts/fetchdata.pl line 31

Initialize Sphinx Index

The index is created with output of fetchdata.pl

/usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf --all

Schedule Index Updates in Cron

1) Now schedule fetchdata in cron:

crontab -e
5 * * * * /usr/local/MPP/scripts/fetchdata.pl >/dev/null 2>&1 </dev/null

2) Add a cronjob to update mppdeltaindex every hour

45 * * * * /usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf mppdeltaindex --rotate >/dev/null 2>&1 </dev/null

Start Sphinx Daemon (searchd)

Before running serachd you must build the index. The index is built from the output of fetchdata.pl. Running fetchdata.pl and the index scripts can take a long time so take this into consideration.

1) Create both Sphinx indexes: mppindex and mppdeltaindex

/usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf mppdeltaindex

2) Start Sphinx Daemon

/usr/local/sphinx/bin/searchd --config /usr/local/sphinx/etc/sphinx.conf

We suggest to add previous line in your local startup script:

RH/Fedora/CentOS: /etc/rc.local
Debian: /etc/rc.local
Slackware: /etc/rc.d/rc.local
Gentoo: /etc/conf.d/local.start

Verify Installation

Now, we are able to perform full text searches for command line (it will extract messages were there matches too for verification)

/usr/local/sphinx/bin/search -c /usr/local/sphinx/etc/sphinx.conf ovidiu 

More info can be found here: http://www.sphinxsearch.com/doc.html

NOTE: Once per month disable temporary indexer cronjob and rebuild main index:
/usr/local/sphinx/bin/indexer --config /usr/local/sphinx/etc/sphinx.conf --all --rotate

Rebuilding Full Text Index

Please see Rebuilding Sphinx Index for instructions for rebuilding the index.

Set in MPP Manager

In http://youhost:20000 Setup -> Module Config set the following:

sphinx_dir  	/usr/local/sphinx

Notes

If you have multiple front-ends to your email archive this process must be replicated on each server.

All of this is pre-configured in the MPP Virtual Appliance