Hi Otis,
More input, though mostly from recent experience w/Bixo...
I'm trying to do some basic calculations trying to figure out
what, in terms
of
time, resources, and cost, it would take to crawl 500M URLs.
The obvious environment for this is EC2, so I'm wondering what
people are
seei
Hi Otis,
I'm trying to do some basic calculations trying to figure out
what, in terms
of
time, resources, and cost, it would take to crawl 500M URLs.
I can't directly comment on Nutch, but we recently did something
similar to
this (563M pages via EC2) using Bixo.
Eh, I was going to
Hi
> I'm trying to do some basic calculations trying to figure out what, in terms
> > of
> > time, resources, and cost, it would take to crawl 500M URLs.
> > The obvious environment for this is EC2, so I'm wondering what people are
> > seeing
> > in terms of fetch rate there these days? 50 pa
Hi,
> Hi Otis,
Hi Ken :)
> > I'm trying to do some basic calculations trying to figure out what, in
> > terms
>of
> > time, resources, and cost, it would take to crawl 500M URLs.
>
> I can't directly comment on Nutch, but we recently did something similar to
>this (563M pages via EC2) us
Hi,
I am trying to configure my nutch crawler with the runbot script from
the wiki. http://wiki.apache.org/nutch/Crawl
I tried to insert regular expressions into regex-urlfilter.txt and into
crawl-urlfilter.txt but it seems they are not working. Now I do not know
whether my Regex is wrong or
Hi Otis,
I'm trying to do some basic calculations trying to figure out what,
in terms of
time, resources, and cost, it would take to crawl 500M URLs.
I can't directly comment on Nutch, but we recently did something
similar to this (563M pages via EC2) using Bixo.
Since we're also using a
Hi Julien,
On 3/4/11 7:09 PM, Julien Nioche wrote:
Thanks for reporting the problem Jurgen. and sorry that you felt you
were being ignored. The few active developers Nutch has contribute
during their spare time, the reason why you did not get any comments
on this, is that no one had an instant
Hi Otis,
I'm trying to do some basic calculations trying to figure out what, in terms
> of
> time, resources, and cost, it would take to crawl 500M URLs.
> The obvious environment for this is EC2, so I'm wondering what people are
> seeing
> in terms of fetch rate there these days? 50 pages/second?
Hi Jurgen,
> Since I wrote this email - which I thought got ignored by the
> Nutch developers -
Thanks for reporting the problem Jurgen. and sorry that you felt you were
being ignored. The few active developers Nutch has contribute during their
spare time, the reason why you did not get any com
9 matches
Mail list logo