or even without any restrictions from robots.txt : by default Nutch waits 5
secs between fetchs from a same host. If you have 100K URLs from a single
host then it will take at least 138 hours just to fetch them

On 30 August 2011 11:23, Markus Jelsma <[email protected]> wrote:

> Your questions was valid: why is my fetch too slow and how to accelerate?
>
> Again, first check your robots.txt. With so few domains it's almost certain
> that politeness is the problem here.
>
> > Hi List,
> > Hi Hannes,
> >
> > All logs are without Errors and Warnings. Injecting, Updating, merging
> and
> > indexing is not a problem and takes minutes only. One cycle takes 2 days
> > with my parameters. Regex-urlfilter.txt is checked against the URL format
> > from all sites.
> >
> > But I'm sorry to the list, I may have not clear asked. I'm interested
> > mainly why there is such big difference between fetched and unfetched
> URLs
> > and what can I do to force fetching?
> >
> > Please see my current readdb -stats output:
> > TOTAL urls: 1698520
> > [...]
> > status 1 (db_unfetched): 1567047
> > status 2 (db_fetched): 90399
> > status 3 (db_gone): 11696
> > status 4 (db_redir_temp): 4065
> > status 5 (db_redir_perm): 10137
> > status 6 (db_notmodified): 15176
> >
> > The process runs now exactly 30 days. In the meantime I have now 90,399
> > fetched instead of 30,000 after 15 days. Is this normal?
> >
> > Regards
> > Thomas
> >
> > Von: Hannes Carl Meyer [mailto:[email protected]]
> > Gesendet: Dienstag, 30. August 2011 09:25
> > An: [email protected]
> > Cc: Eggebrecht, Thomas (GfK Marktforschung)
> > Betreff: Re: Parameter tuning or how to accelerate fetching
> >
> > Hi Thomas,
> >
> > first, 30,000 pages in two weeks is somewhat of few...
> >
> > where did you get the total number of pages from? By Crawl-DB?
> > Please post a bin/nutch readdb crawldb/ -stats output here.
> >
> > How long does one cycle takes?
> >
> > If your regex-urlfilter.txt is still the standard setting, check your
> > websites for common query URLs containing like
> > "index.php?param=value&param1..". The standard regex-urlfilter is
> > sometimes very strict in this case.
> >
> > BR
> >
> > Hannes
> >
> > --
> >
> > https://www.xing.com/profile/HannesCarl_Meyer
> > http://de.linkedin.com/in/hannescarlmeyer
> > On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK Marktforschung)
> > <[email protected]<mailto:[email protected]>> wrote:
> Dear
> > List,
> >
> > My process fetches only 10 but very big domains with millions of pages on
> > each site. I now wonder way I got after 2 weeks and 17 crawl-fetch cycles
> > only a handful of about 30,000 pages and it seems stagnating.
> >
> > How would you accelerate fetching?
> >
> > My current parameters (using Nutch-1.2):
> > topN: 40,000
> > depth: 8
> > adddays: 30
> > fetcher.server.delay: 1
> > db.max.outlinks.per.page: 500
> >
> > All parameters not mentioned have standard values as well as
> > regex-urlfilter.txt.
> >
> > Best Regards
> > Thomas
> >
> >
> > ________________________________
> >
> > GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> > Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> > (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent,
> Wilhelm
> > R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This
> email
> > and any attachments may contain confidential or privileged information.
> > Please note that unauthorized copying, disclosure or distribution of the
> > material in this email is not permitted.
> >
> >
> >
> > ________________________________
> >
> > GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> > Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> > (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent,
> Wilhelm
> > R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert This
> email
> > and any attachments may contain confidential or privileged information.
> > Please note that unauthorized copying, disclosure or distribution of the
> > material in this email is not permitted.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to