RE: Nutch Crawl a Specific List Of URLs (150K)

2014-01-03 Thread Markus Jelsma
@nutch.apache.org Subject: Re: Nutch Crawl a Specific List Of URLs (150K) Thanks for all the response, they are very inspiring and diving into the log level is very beneficial to learn Nutch. The fact is that I use Python BeautifulSoup to parse the sitemap of my targeted website, which comes up with those 150K

Re: Nutch Crawl a Specific List Of URLs (150K)

2014-01-02 Thread Bin Wang
stopped when it did. Cheers -Original message- From: Bin Wangbinwang...@gmail.com Sent: Friday 27th December 2013 19:50 To: dev@nutch.apache.org Subject: Nutch Crawl a Specific List Of URLs (150K) Hi, I have a very specific list of URLs, which is about 140K URLs. I switch off

RE: Nutch Crawl a Specific List Of URLs (150K)

2013-12-30 Thread Markus Jelsma
much it crawled and why it likely stopped when it did. Cheers -Original message- From: Bin Wangbinwang...@gmail.com Sent: Friday 27th December 2013 19:50 To: dev@nutch.apache.org Subject: Nutch Crawl a Specific List Of URLs (150K) Hi, I have a very specific list of URLs, which is about

Re: Nutch Crawl a Specific List Of URLs (150K)

2013-12-29 Thread Tejas Patil
Hi Bin Wang, nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 You were creating a new crawldb or reusing some old one ? Were you running this on a cluster or in local mode ? Was there any failure due to which the fetch round got aborted ? (see logs for this). I would like to

Re: Nutch Crawl a Specific List Of URLs (150K)

2013-12-28 Thread Talat Uyarer
Hi Bin, You have interesting error. I don't use 1.7 but I can try with screen command. I believe you will not get same error. Talat 2013/12/27 Bin Wang binwang...@gmail.com Hi, I have a very specific list of URLs, which is about 140K URLs. I switch off the `db.update.additions.allowed`

Nutch Crawl a Specific List Of URLs (150K)

2013-12-27 Thread Bin Wang
Hi, I have a very specific list of URLs, which is about 140K URLs. I switch off the `db.update.additions.allowed` so it will not update the crawldb... and I was assuming I can feed all the URLs to Nutch, and after one round of fetching, it will finish and leave all the raw HTML files in the