Awesome Arkadi. This sounds legit. Can you scope this?
https://github.com/apache/nutch/#contributing File an issue and then push a PR I’ll be sure to merge it. Cheers! Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: "[email protected]" <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, April 17, 2015 at 12:31 AM To: "[email protected]" <[email protected]> Subject: A bug in org.apache.nutch.parse.ParseUtil? >Hi, > >From reading the code it is clear that it is designed to allow using >several parsers to parse a document in a sequence, until it is >successfully parsed. In practice, this does not work because these lines > >f (parseResult != null && !parseResult.isEmpty()) > return parseResult; > >break the loop even if the parsing has failed because parseResult is not >empty anyway, it contains a ParseData with ParseStatus.FAILED. >This is easy to fix, for example, by adding a line before the two lines >mentioned above: > >if ( parseResult != null ) parseResult.filter() ; > >This will remove failed ParseData objects from the parseResult and leave >it empty if the parsing had been unsuccessful. I believe that this fix is >important because it allows use of backup parsers as originally designed >and thus increase index completeness. > >Regards, >Arkadi > >

