@nutch.apache.org
Subject: Re: Nutch Crawl a Specific List Of URLs (150K)
Thanks for all the response, they are very inspiring and diving into the log
level is very beneficial to learn Nutch.
The fact is that I use Python BeautifulSoup to parse the sitemap of my targeted
website, which comes up with those 150K
stopped when it did.
Cheers
-Original message-
From: Bin Wangbinwang...@gmail.com
Sent: Friday 27th December 2013 19:50
To: dev@nutch.apache.org
Subject: Nutch Crawl a Specific List Of URLs (150K)
Hi,
I have a very specific list of URLs, which is about 140K URLs.
I switch off
much it crawled and
why it likely stopped when it did.
Cheers
-Original message-
From: Bin Wangbinwang...@gmail.com
Sent: Friday 27th December 2013 19:50
To: dev@nutch.apache.org
Subject: Nutch Crawl a Specific List Of URLs (150K)
Hi,
I have a very specific list of URLs, which is about
Hi Bin Wang,
nohup bin/nutch crawl urls -dir result -depth 1 -topN 20
You were creating a new crawldb or reusing some old one ?
Were you running this on a cluster or in local mode ?
Was there any failure due to which the fetch round got aborted ? (see logs
for this).
I would like to
Hi Bin,
You have interesting error. I don't use 1.7 but I can try with screen
command. I believe you will not get same error.
Talat
2013/12/27 Bin Wang binwang...@gmail.com
Hi,
I have a very specific list of URLs, which is about 140K URLs.
I switch off the `db.update.additions.allowed`
Hi,
I have a very specific list of URLs, which is about 140K URLs.
I switch off the `db.update.additions.allowed` so it will not update the
crawldb... and I was assuming I can feed all the URLs to Nutch, and after
one round of fetching, it will finish and leave all the raw HTML files in
the
6 matches
Mail list logo