Done, but now get additional errors: ------------------- llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch updatedb /home/llist/nutchData/crawl/crawldb -dir /home/llist/nutchData/crawl/segments/20110716105826 CrawlDb update: starting at 2011-07-16 11:03:56 CrawlDb update: db: /home/llist/nutchData/crawl/crawldb CrawlDb update: segments: [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch, file:/home/llist/nutchData/crawl/segments/20110716105826/content, file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse, file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data, file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate, file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110716105826/content - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-07-16 11:03:57, elapsed: 00:00:01 -------------------------------------------
On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote: > fetch, then parse. > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are the commands and > > relevant output. > > > > ---------------------------------- > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch > > inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed > > Injector: starting at 2011-07-15 18:32:10 > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb > > Injector: urlDir: /home/llist/nutchData/seed > > Injector: Converting injected urls to crawl db entries. > > Injector: Merging injected urls into crawl db. > > Injector: finished at 2011-07-15 18:32:13, elapsed: 00:00:02 > > ================= > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch > > generate /home/llist/nutchData/crawl/crawldb > > /home/llist/nutchData/crawl/segments Generator: starting at 2011-07-15 > > 18:32:41 > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: true > > Generator: normalizing: true > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: Partitioning selected urls for politeness. > > Generator: segment: /home/llist/nutchData/crawl/segments/20110715183244 > > Generator: finished at 2011-07-15 18:32:45, elapsed: 00:00:03 > > ================== > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch > > fetch /home/llist/nutchData/crawl/segments/20110715183244 > > Fetcher: Your 'http.agent.name' value should be listed first in > > 'http.robots.agents' property. > > Fetcher: starting at 2011-07-15 18:34:55 > > Fetcher: segment: /home/llist/nutchData/crawl/segments/20110715183244 > > Fetcher: threads: 10 > > QueueFeeder finished: total 1 records + hit by time limit :0 > > fetching http://www.seek.com.au/ > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=2 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > > -finishing thread FetcherThread, activeThreads=0 > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=0 > > Fetcher: finished at 2011-07-15 18:34:59, elapsed: 00:00:03 > > ================= > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch > > updatedb /home/llist/nutchData/crawl/crawldb > > -dir /home/llist/nutchData/crawl/segments/20110715183244 > > CrawlDb update: starting at 2011-07-15 18:36:00 > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb > > CrawlDb update: segments: > > [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch, > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate, > > file:/home/llist/nutchData/crawl/segments/20110715183244/content] > > CrawlDb update: additions allowed: true > > CrawlDb update: URL normalizing: false > > CrawlDb update: URL filtering: false > > - skipping invalid segment > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch > > - skipping invalid segment > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate > > - skipping invalid segment > > file:/home/llist/nutchData/crawl/segments/20110715183244/content > > CrawlDb update: Merging segment data into db. > > CrawlDb update: finished at 2011-07-15 18:36:01, elapsed: 00:00:01 > > ----------------------------------- > > > > Appreciate any hints on what I'm missing.

