Re: Pages per second on EC2?

2011-03-04 Thread Ken Krugler
Hi Otis, More input, though mostly from recent experience w/Bixo... I'm trying to do some basic calculations trying to figure out what, in terms of time, resources, and cost, it would take to crawl 500M URLs. The obvious environment for this is EC2, so I'm wondering what people are seei

Re: Pages per second on EC2?

2011-03-04 Thread Ken Krugler
Hi Otis, I'm trying to do some basic calculations trying to figure out what, in terms of time, resources, and cost, it would take to crawl 500M URLs. I can't directly comment on Nutch, but we recently did something similar to this (563M pages via EC2) using Bixo. Eh, I was going to

Re: Pages per second on EC2?

2011-03-04 Thread Otis Gospodnetic
Hi > I'm trying to do some basic calculations trying to figure out what, in terms > > of > > time, resources, and cost, it would take to crawl 500M URLs. > > The obvious environment for this is EC2, so I'm wondering what people are > > seeing > > in terms of fetch rate there these days? 50 pa

Re: Pages per second on EC2?

2011-03-04 Thread Otis Gospodnetic
Hi, > Hi Otis, Hi Ken :) > > I'm trying to do some basic calculations trying to figure out what, in > > terms >of > > time, resources, and cost, it would take to crawl 500M URLs. > > I can't directly comment on Nutch, but we recently did something similar to >this (563M pages via EC2) us

How to find out which urlfilter File I am using

2011-03-04 Thread Klemens Muthmann
Hi, I am trying to configure my nutch crawler with the runbot script from the wiki. http://wiki.apache.org/nutch/Crawl I tried to insert regular expressions into regex-urlfilter.txt and into crawl-urlfilter.txt but it seems they are not working. Now I do not know whether my Regex is wrong or

Re: Pages per second on EC2?

2011-03-04 Thread Ken Krugler
Hi Otis, I'm trying to do some basic calculations trying to figure out what, in terms of time, resources, and cost, it would take to crawl 500M URLs. I can't directly comment on Nutch, but we recently did something similar to this (563M pages via EC2) using Bixo. Since we're also using a

Re: Nutch Parser annoyingly faulty

2011-03-04 Thread Juergen Specht
Hi Julien, On 3/4/11 7:09 PM, Julien Nioche wrote: Thanks for reporting the problem Jurgen. and sorry that you felt you were being ignored. The few active developers Nutch has contribute during their spare time, the reason why you did not get any comments on this, is that no one had an instant

Re: Pages per second on EC2?

2011-03-04 Thread Julien Nioche
Hi Otis, I'm trying to do some basic calculations trying to figure out what, in terms > of > time, resources, and cost, it would take to crawl 500M URLs. > The obvious environment for this is EC2, so I'm wondering what people are > seeing > in terms of fetch rate there these days? 50 pages/second?

Re: Nutch Parser annoyingly faulty

2011-03-04 Thread Julien Nioche
Hi Jurgen, > Since I wrote this email - which I thought got ignored by the > Nutch developers - Thanks for reporting the problem Jurgen. and sorry that you felt you were being ignored. The few active developers Nutch has contribute during their spare time, the reason why you did not get any com