Your questions was valid: why is my fetch too slow and how to accelerate? Again, first check your robots.txt. With so few domains it's almost certain that politeness is the problem here.
> Hi List, > Hi Hannes, > > All logs are without Errors and Warnings. Injecting, Updating, merging and > indexing is not a problem and takes minutes only. One cycle takes 2 days > with my parameters. Regex-urlfilter.txt is checked against the URL format > from all sites. > > But I'm sorry to the list, I may have not clear asked. I'm interested > mainly why there is such big difference between fetched and unfetched URLs > and what can I do to force fetching? > > Please see my current readdb -stats output: > TOTAL urls: 1698520 > [...] > status 1 (db_unfetched): 1567047 > status 2 (db_fetched): 90399 > status 3 (db_gone): 11696 > status 4 (db_redir_temp): 4065 > status 5 (db_redir_perm): 10137 > status 6 (db_notmodified): 15176 > > The process runs now exactly 30 days. In the meantime I have now 90,399 > fetched instead of 30,000 after 15 days. Is this normal? > > Regards > Thomas > > Von: Hannes Carl Meyer [mailto:[email protected]] > Gesendet: Dienstag, 30. August 2011 09:25 > An: [email protected] > Cc: Eggebrecht, Thomas (GfK Marktforschung) > Betreff: Re: Parameter tuning or how to accelerate fetching > > Hi Thomas, > > first, 30,000 pages in two weeks is somewhat of few... > > where did you get the total number of pages from? By Crawl-DB? > Please post a bin/nutch readdb crawldb/ -stats output here. > > How long does one cycle takes? > > If your regex-urlfilter.txt is still the standard setting, check your > websites for common query URLs containing like > "index.php?param=value¶m1..". The standard regex-urlfilter is > sometimes very strict in this case. > > BR > > Hannes > > -- > > https://www.xing.com/profile/HannesCarl_Meyer > http://de.linkedin.com/in/hannescarlmeyer > On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK Marktforschung) > <[email protected]<mailto:[email protected]>> wrote: Dear > List, > > My process fetches only 10 but very big domains with millions of pages on > each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles > only a handful of about 30,000 pages and it seems stagnating. > > How would you accelerate fetching? > > My current parameters (using Nutch-1.2): > topN: 40,000 > depth: 8 > adddays: 30 > fetcher.server.delay: 1 > db.max.outlinks.per.page: 500 > > All parameters not mentioned have standard values as well as > regex-urlfilter.txt. > > Best Regards > Thomas > > > ________________________________ > > GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; > Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp > (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm > R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This email > and any attachments may contain confidential or privileged information. > Please note that unauthorized copying, disclosure or distribution of the > material in this email is not permitted. > > > > ________________________________ > > GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; > Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp > (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm > R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This email > and any attachments may contain confidential or privileged information. > Please note that unauthorized copying, disclosure or distribution of the > material in this email is not permitted.

