Hi Mark

See
http://digitalpebble.blogspot.com/2011/05/processing-enron-dataset-using-behemoth.htmlfor
comments on processing the Enron corpus with Tika. Some of the errors
that you are seeing are probably described there.

Julien

On 7 September 2011 02:29, Mark Kerzner <[email protected]> wrote:

> Hi,
>
> as part of testing my FreeEed <http://freeeed.org/> open source eDiscovery
> engine, I am processing the 153 Enron PSTs found 
> here<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2>
> .
>
> Naturally, I see lot of errors and warning. For example, I started with the
> error described here <https://issues.apache.org/jira/browse/PDFBOX-1008>.
> For that, I replaced version of PDFBox from 1.5.0 to 1.6.0, since I am
> building with maven from the latest svn checkout anyway.
>
> However, for the future, my question is: is there a more systematic way to
> approach this. Is anybody interested in the results of all the testing that
> I am doing, and if yes, how should I report my findings?
>
> Thank you,
> Mark
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to