Hi Thomas,

we have kind of the same problems on some hosts.

Therefore we are running a squid proxy with a bunch of good ol' outgoing ip
addresses to prevent penalties. The squid proxy randomly chooses ip
addresses for each request.

BR

Hannes

-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer

On Tue, Aug 30, 2011 at 1:20 PM, Eggebrecht, Thomas (GfK Marktforschung) <
[email protected]> wrote:

> Hi Markus,
> my conf:
> fetcher.server.delay: 1
> fetcher.threads.fetch: 10
> fetcher.threads.per.host: 1
>
> One URL per second from each host. It may be too polite? These all very big
> sites with potent resources but I don't want my IP to be rejected. What
> would you operate?
>
> Regards
> Thomas
>
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Tuesday, August 30, 2011 1:07 PM
> To: [email protected]
> Subject: Re: Parameter tuning or how to accelerate fetching
>
> Hmm, i actually first expected some domains to have specified a crawl
> delay.
> Did you customize your fetcher configuration?
>
> - fetcher.server.delay
> - fetcher.threads.fetch
> - fetcher.threads.per.host
>
>
> Take care, this is going to change slightly in NUTCH-1073
>
> On Tuesday 30 August 2011 12:55:56 Eggebrecht, Thomas (GfK Marktforschung)
> wrote:
> > Now I checked all robots.txt. They all allow me to enter the areas of
> > my interest. Furthermore if I run a search with NutchBean I see all my
> > desired domains quite equally represented: total hits: 91870
> > 13 domain(s):
> > {www.carmondo.de=3903, www.autoextrem.de=1983, www.kbb.com=3706,
> > www.motor-talk.de=7366, www.auto-motor-und-sport.de=12842,
> > www.hunny.de=13107, www.pkw-forum.de=3605, forum.autobild.de=1640,
> > www.carmagazine.co.uk=3740, www.bmw-syndikat.de=33305,
> > community.evo.co.uk=3577, www.pistonheads.com=1450,
> > www.edmunds.com=1646}
> >
> > I think robots.txt can't be the reason.
> >
> > Regards
> > Thomas
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Markus Jelsma [mailto:[email protected]]
> > Gesendet: Dienstag, 30. August 2011 12:24
> > An: [email protected]
> > Cc: Eggebrecht, Thomas (GfK Marktforschung)
> > Betreff: Re: AW: Parameter tuning or how to accelerate fetching
> >
> > Your questions was valid: why is my fetch too slow and how to accelerate?
> >
> > Again, first check your robots.txt. With so few domains it's almost
> > certain that politeness is the problem here.
> >
> > > Hi List,
> > > Hi Hannes,
> > >
> > > All logs are without Errors and Warnings. Injecting, Updating,
> > > merging and indexing is not a problem and takes minutes only. One
> > > cycle takes
> > > 2 days with my parameters. Regex-urlfilter.txt is checked against
> > > the URL format from all sites.
> > >
> > > But I'm sorry to the list, I may have not clear asked. I'm
> > > interested mainly why there is such big difference between fetched
> > > and unfetched URLs and what can I do to force fetching?
> > >
> > > Please see my current readdb -stats output:
> > > TOTAL urls: 1698520
> > > [...]
> > > status 1 (db_unfetched): 1567047
> > > status 2 (db_fetched): 90399
> > > status 3 (db_gone): 11696
> > > status 4 (db_redir_temp): 4065
> > > status 5 (db_redir_perm): 10137
> > > status 6 (db_notmodified): 15176
> > >
> > > The process runs now exactly 30 days. In the meantime I have now
> > > 90,399 fetched instead of 30,000 after 15 days. Is this normal?
> > >
> > > Regards
> > > Thomas
> > >
> > > Von: Hannes Carl Meyer [mailto:[email protected]]
> > > Gesendet: Dienstag, 30. August 2011 09:25
> > > An: [email protected]
> > > Cc: Eggebrecht, Thomas (GfK Marktforschung)
> > > Betreff: Re: Parameter tuning or how to accelerate fetching
> > >
> > > Hi Thomas,
> > >
> > > first, 30,000 pages in two weeks is somewhat of few...
> > >
> > > where did you get the total number of pages from? By Crawl-DB?
> > > Please post a bin/nutch readdb crawldb/ -stats output here.
> > >
> > > How long does one cycle takes?
> > >
> > > If your regex-urlfilter.txt is still the standard setting, check
> > > your websites for common query URLs containing like
> > > "index.php?param=value&param1..". The standard regex-urlfilter is
> > > sometimes very strict in this case.
> > >
> > > BR
> > >
> > > Hannes
> > >
> > > --
> > >
> > > https://www.xing.com/profile/HannesCarl_Meyer
> > > http://de.linkedin.com/in/hannescarlmeyer
> > > On Mon, Aug 29, 2011 at 5:33 PM, Eggebrecht, Thomas (GfK
> > > Marktforschung)
> > > <[email protected]<mailto:[email protected]>> wrote:
> > > Dear List,
> > >
> > > My process fetches only 10 but very big domains with millions of
> > > pages on each site. I now wonder way I got after 2 weeks and 17
> > > crawl-fetch cycles only a handful of about 30,000 pages and it seems
> stagnating.
> > >
> > > How would you accelerate fetching?
> > >
> > > My current parameters (using Nutch-1.2):
> > > topN: 40,000
> > > depth: 8
> > > adddays: 30
> > > fetcher.server.delay: 1
> > > db.max.outlinks.per.page: 500
> > >
> > > All parameters not mentioned have standard values as well as
> > > regex-urlfilter.txt.
> > >
> > > Best Regards
> > > Thomas
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. Wübbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
> This email and any attachments may contain confidential or privileged
> information. Please note that unauthorized copying, disclosure or
> distribution of the material in this email is not permitted.
>

Reply via email to