I get it from this site, http://www.edrm.net/resources/data-sets, where it is much more complete. You can check there
On Sat, Sep 17, 2011 at 2:08 AM, Albretch Mueller <[email protected]> wrote: > from a corpus analysis point of view, who owns this data?, how do we > know it is the real thing? > ~ > I don't see any validation data by Enron Email Dataset > (http://www.cs.cmu.edu/~enron/) > ~ > lbrtchx > > On 9/15/11, Mark Kerzner <[email protected]> wrote: > > Mike, > > > > I certainly will do it. I am refactoring the code before I run those > tests > > again. > > > > Sincerely, > > Mark > > > > On Thu, Sep 15, 2011 at 5:26 AM, Michael McCandless < > > [email protected]> wrote: > > > >> That summary is nice, but, can you provide specifics on which docs > >> caused problems for Tika? > >> > >> Ie, if a certain doc hits an exception, we should open a Jira issue > >> and get it fixed... > >> > >> Thanks, > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> On Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <[email protected]> > >> wrote: > >> > The processing is complete, the summary found here. > >> > Mark > >> > > >> > On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless > >> > <[email protected]> wrote: > >> >> > >> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[email protected]> > >> >> wrote: > >> >> > >> >> > Is anybody interested in the results of all the testing that > >> >> > I am doing, and if yes, how should I report my findings? > >> >> > >> >> I'm interested! This sounds great.... > >> >> > >> >> Tika should strive to have no errors on any valid documents... so if > >> >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's > >> >> characterize them, open issues, and get them fixed :) > >> >> > >> >> Mike McCandless > >> >> > >> >> http://blog.mikemccandless.com > >> > > >> > > >> > > >
