from a corpus analysis point of view, who owns this data?, how do we know it is the real thing? ~ I don't see any validation data by Enron Email Dataset (http://www.cs.cmu.edu/~enron/) ~ lbrtchx
On 9/15/11, Mark Kerzner <[email protected]> wrote: > Mike, > > I certainly will do it. I am refactoring the code before I run those tests > again. > > Sincerely, > Mark > > On Thu, Sep 15, 2011 at 5:26 AM, Michael McCandless < > [email protected]> wrote: > >> That summary is nice, but, can you provide specifics on which docs >> caused problems for Tika? >> >> Ie, if a certain doc hits an exception, we should open a Jira issue >> and get it fixed... >> >> Thanks, >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <[email protected]> >> wrote: >> > The processing is complete, the summary found here. >> > Mark >> > >> > On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless >> > <[email protected]> wrote: >> >> >> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[email protected]> >> >> wrote: >> >> >> >> > Is anybody interested in the results of all the testing that >> >> > I am doing, and if yes, how should I report my findings? >> >> >> >> I'm interested! This sounds great.... >> >> >> >> Tika should strive to have no errors on any valid documents... so if >> >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's >> >> characterize them, open issues, and get them fixed :) >> >> >> >> Mike McCandless >> >> >> >> http://blog.mikemccandless.com >> > >> > >> >
