from a corpus analysis point of view, who owns this data?, how do we
know it is the real thing?
~
 I don't see any validation data by Enron Email Dataset
(http://www.cs.cmu.edu/~enron/)
~
 lbrtchx

On 9/15/11, Mark Kerzner <[email protected]> wrote:
> Mike,
>
> I certainly will do it. I am refactoring the code before I run those tests
> again.
>
> Sincerely,
> Mark
>
> On Thu, Sep 15, 2011 at 5:26 AM, Michael McCandless <
> [email protected]> wrote:
>
>> That summary is nice, but, can you provide specifics on which docs
>> caused problems for Tika?
>>
>> Ie, if a certain doc hits an exception, we should open a Jira issue
>> and get it fixed...
>>
>> Thanks,
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <[email protected]>
>> wrote:
>> > The processing is complete, the summary found here.
>> > Mark
>> >
>> > On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless
>> > <[email protected]> wrote:
>> >>
>> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[email protected]>
>> >> wrote:
>> >>
>> >> > Is anybody interested in the results of all the testing that
>> >> > I am doing, and if yes, how should I report my findings?
>> >>
>> >> I'm interested!  This sounds great....
>> >>
>> >> Tika should strive to have no errors on any valid documents... so if
>> >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's
>> >> characterize them, open issues, and get them fixed :)
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >
>> >
>>
>

Reply via email to