Hi Mark See http://digitalpebble.blogspot.com/2011/05/processing-enron-dataset-using-behemoth.htmlfor comments on processing the Enron corpus with Tika. Some of the errors that you are seeing are probably described there.
Julien On 7 September 2011 02:29, Mark Kerzner <[email protected]> wrote: > Hi, > > as part of testing my FreeEed <http://freeeed.org/> open source eDiscovery > engine, I am processing the 153 Enron PSTs found > here<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2> > . > > Naturally, I see lot of errors and warning. For example, I started with the > error described here <https://issues.apache.org/jira/browse/PDFBOX-1008>. > For that, I replaced version of PDFBox from 1.5.0 to 1.6.0, since I am > building with maven from the latest svn checkout anyway. > > However, for the future, my question is: is there a more systematic way to > approach this. Is anybody interested in the results of all the testing that > I am doing, and if yes, how should I report my findings? > > Thank you, > Mark > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
