Hi, I've been looking for some time now the reasons of Parse reduce taking a lot of time. And i've found lots of different suggestions but no many feedbacks on which are working or not.
First here is a list of the thread i've found, and also the Patch 1314 : http://lucene.472066.n3.nabble.com/Parse-reduce-slow-as-a-snail-td3296865.html http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-td3758053.html http://lucene.472066.n3.nabble.com/ParseSegment-slow-reduce-phase-td612119.html https://issues.apache.org/jira/browse/NUTCH-1314 Here are some questions about what i've found on them: - It's seems that parse reduce time is mainly due to long urls => Is there anyone who can confirm since he has excluded long urls (with patch or regex or whatever, he now have better perfs?) - Normalizing step is occuring before filtering: => If so, is there a real interest to filter urls with regex (like the -^.{350,}$ expression) ? -The patch 1314 seems to be done when you parse with parse-html => i'm using boilerpipe with patch NUTCH-961, should the patch 1314 work with it? (i guess not) and what change should i make (i'm quite afraid to do a patch/plugin myself) . This is not an exhaustive list of questions, so if you have questions and/or recommandations, please add them. Sorry to start a new thread since it could have been added as an answer to my last one: http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-td4037673.html but i think the title of this one could be useful for more people (mine was too specific) -- View this message in context: http://lucene.472066.n3.nabble.com/Best-Practice-to-optimize-Parse-reduce-step-ParseoutputFormat-tp4039200.html Sent from the Nutch - User mailing list archive at Nabble.com.

