On Thu, 7 Oct 2010 09:48:57 -0700 (PDT), webdev1977
<webdev1...@gmail.com> wrote:
So how is it that one is able to crawl huge websites with the crawl
script
and not use the parse = false? You would have to have enormous
amounts of
disk space to run the parse later.
You can run smaller batches by limiting the number of URL's per host or
IP when generating your fetch list.
I am not even able to run with fetcher.parse=false and
fetcher.store.content= true. I get an out of memory error.
I can get about 14,500 urls with fetcher.parse=true and
fetcher.store.content=false, but that is not even close to the
correct
number of documents that the crawl should be getting. It should be
fetching
around 350,000 documents. I increased the value of
db.max.outlinks.per.page. Now I am getting more urls, but it is
dying after
around 12,500 or so with hung threads errors. Seems like there are
alot of
files getting written by pdfbox to the /tmp directory. One of which
was
almost 2 GB!! \
Not sure if it helps, try using a tmp dir on a larger disk. Use the
hadoop.tmp.dir property to do so. An again, i think smaller batches will
be more helpful, also in debugging.
Am I kidding myself that running the nutch crawl script is the way to
build
an index of about 350,000 documents? All of which are file urls? All
within
a local config, not using hdfs? Maybe I am crazy? :-) (thats
entirely
possible!).
I have a test set up that fetched, parsed, updated and indexed about
three million URL's (mostly ordinary web pages in my case). The
generated segments take about 120GB of disk space, the Solr index is
about 30GB. 350,000 documents isn't that much but if the files are large
then you are on the right track (i think) when you want to parse but not
store. Using a smaller batch will help again.
I don't want to invest the time on setting up a cluster for hadoop.
1. I
don't have the hardware resources at the moment 2. This is a
prototype, but
if I can't prove that a crawl will finish (even if it takes a week),
then
the prototype will not be developed into anything more useful.
That'd be a pity, please try the smaller batches and report back, hope
it helps.