Best Practice to optimize Parse reduce step / ParseoutputFormat

kemical Fri, 08 Feb 2013 01:53:38 -0800

Hi,

I've been looking for some time now the reasons of Parse reduce taking a lot
of time. And i've found lots of different suggestions but no many feedbacks
on which are working or not.



First here is a list of the thread i've found, and also the Patch 1314 :

http://lucene.472066.n3.nabble.com/Parse-reduce-slow-as-a-snail-td3296865.html
http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-td3758053.html
http://lucene.472066.n3.nabble.com/ParseSegment-slow-reduce-phase-td612119.html
https://issues.apache.org/jira/browse/NUTCH-1314

Here are some questions about what i've found on them:

- It's seems that parse reduce time is mainly due to long urls
=> Is there anyone who can confirm since he has excluded long urls (with
patch or regex or whatever, he now have better perfs?)

- Normalizing step is occuring before filtering:
=> If so, is there a real interest to filter urls with regex (like the
-^.{350,}$ expression) ?

-The patch 1314 seems to be done when you parse with parse-html
=> i'm using boilerpipe with patch NUTCH-961, should the patch 1314 work
with it? (i guess not) and what change should i make (i'm quite afraid to do
a patch/plugin myself) . 

This is not an exhaustive list of questions, so if you have questions and/or
recommandations, please add them.



Sorry to start a new thread since it could have been added as an answer to
my last one:
http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-td4037673.html
but i think the title of this one could be useful for more people (mine was
too specific)





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-Practice-to-optimize-Parse-reduce-step-ParseoutputFormat-tp4039200.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Best Practice to optimize Parse reduce step / ParseoutputFormat

Reply via email to