Re: A bug in org.apache.nutch.parse.ParseUtil?

Mattmann, Chris A (3980) Fri, 17 Apr 2015 08:26:42 -0700

Awesome Arkadi. This sounds legit.

Can you scope this?


https://github.com/apache/nutch/#contributing


File an issue and then push a PR I’ll be sure to merge
it.

Cheers!

Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: "[email protected]" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, April 17, 2015 at 12:31 AM
To: "[email protected]" <[email protected]>
Subject: A bug in org.apache.nutch.parse.ParseUtil?

>Hi,
>
>From reading the code it is clear that it is designed to allow using
>several parsers to parse a document in a sequence, until it is
>successfully parsed. In practice, this does not work because these lines
>
>f (parseResult != null && !parseResult.isEmpty())
>        return parseResult;
>
>break the loop even if the parsing has failed because parseResult is not
>empty anyway, it contains a ParseData with ParseStatus.FAILED.
>This is easy to fix, for example, by adding a line before the two lines
>mentioned above:
>
>if ( parseResult != null ) parseResult.filter() ;
>
>This will remove failed ParseData objects from the parseResult and leave
>it empty if the parsing had been unsuccessful. I believe that this fix is
>important because it allows use of backup parsers as originally designed
>and thus increase index completeness.
>
>Regards,
>Arkadi
>
>

Re: A bug in org.apache.nutch.parse.ParseUtil?

Reply via email to