I get it from this site, http://www.edrm.net/resources/data-sets, where it
is much more complete. You can check there

On Sat, Sep 17, 2011 at 2:08 AM, Albretch Mueller <[email protected]> wrote:

>  from a corpus analysis point of view, who owns this data?, how do we
> know it is the real thing?
> ~
>  I don't see any validation data by Enron Email Dataset
> (http://www.cs.cmu.edu/~enron/)
> ~
>  lbrtchx
>
> On 9/15/11, Mark Kerzner <[email protected]> wrote:
> > Mike,
> >
> > I certainly will do it. I am refactoring the code before I run those
> tests
> > again.
> >
> > Sincerely,
> > Mark
> >
> > On Thu, Sep 15, 2011 at 5:26 AM, Michael McCandless <
> > [email protected]> wrote:
> >
> >> That summary is nice, but, can you provide specifics on which docs
> >> caused problems for Tika?
> >>
> >> Ie, if a certain doc hits an exception, we should open a Jira issue
> >> and get it fixed...
> >>
> >> Thanks,
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <[email protected]>
> >> wrote:
> >> > The processing is complete, the summary found here.
> >> > Mark
> >> >
> >> > On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless
> >> > <[email protected]> wrote:
> >> >>
> >> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[email protected]>
> >> >> wrote:
> >> >>
> >> >> > Is anybody interested in the results of all the testing that
> >> >> > I am doing, and if yes, how should I report my findings?
> >> >>
> >> >> I'm interested!  This sounds great....
> >> >>
> >> >> Tika should strive to have no errors on any valid documents... so if
> >> >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's
> >> >> characterize them, open issues, and get them fixed :)
> >> >>
> >> >> Mike McCandless
> >> >>
> >> >> http://blog.mikemccandless.com
> >> >
> >> >
> >>
> >
>

Reply via email to