Haven't you forgotten to call parse? On 19 July 2011 23:40, Leo Subscriptions <[email protected]> wrote:
> Hi Lewis, > > You are correct about the last post not showing any errors. I just > wanted to show that I don't get any errors if I use 'crawl' and to prove > that I do not have any faults in the conf files or the directories. > > I still get the errors if I use the individual commands inject, > generate, fetch.... > > Cheers, > > Leo > > > > On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote: > > > Hi Leo > > > > Did you resolve? > > > > Your second log data doesn't appear to show any errors however the > > problem you specify if one I have witnessed myself while ago. Since > > you posted have you been able to replicate... or resolve? > > > > > > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions > > <[email protected]> wrote: > > > > I've used crawl to ensure config is correct and I don't get > > any errors, > > so I must be doing something wrong with the individual steps, > > but can;t > > see what. > > > > > -------------------------------------------------------------------------------------------------------------------- > > > > llist@LeosLinux:~/nutchData > > $ /usr/share/nutch/runtime/local/bin/nutch > > > > > > crawl /home/llist/nutchData/seed/urls > > -dir /home/llist/nutchData/crawl > > -depth 3 -topN 5 > > solrUrl is not set, indexing will be skipped... > > crawl started in: /home/llist/nutchData/crawl > > rootUrlDir = /home/llist/nutchData/seed/urls > > threads = 10 > > depth = 3 > > solrUrl=null > > topN = 5 > > Injector: starting at 2011-07-17 09:31:19 > > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb > > > > > > Injector: urlDir: /home/llist/nutchData/seed/urls > > > > Injector: Converting injected urls to crawl db entries. > > Injector: Merging injected urls into crawl db. > > > > > > Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02 > > Generator: starting at 2011-07-17 09:31:22 > > > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: true > > Generator: normalizing: true > > > > > > Generator: topN: 5 > > > > Generator: jobtracker is 'local', generating exactly one > > partition. > > Generator: Partitioning selected urls for politeness. > > > > > > Generator: > > segment: /home/llist/nutchData/crawl/segments/20110717093124 > > Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04 > > > > Fetcher: Your 'http.agent.name' value should be listed first > > in > > 'http.robots.agents' property. > > > > > > Fetcher: starting at 2011-07-17 09:31:26 > > Fetcher: > > segment: /home/llist/nutchData/crawl/segments/20110717093124 > > > > Fetcher: threads: 10 > > QueueFeeder finished: total 1 records + hit by time limit :0 > > fetching http://www.seek.com.au/ > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > -finishing thread FetcherThread, activeThreads=1 > > > > -finishing thread FetcherThread, activeThreads=1 > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > > -finishing thread FetcherThread, activeThreads=0 > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > -activeThreads=0 > > > > > > Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03 > > ParseSegment: starting at 2011-07-17 09:31:29 > > ParseSegment: > > segment: /home/llist/nutchData/crawl/segments/20110717093124 > > ParseSegment: finished at 2011-07-17 09:31:32, elapsed: > > 00:00:02 > > CrawlDb update: starting at 2011-07-17 09:31:32 > > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb > > CrawlDb update: segments: > > > > > > [/home/llist/nutchData/crawl/segments/20110717093124] > > > > CrawlDb update: additions allowed: true > > > > > > CrawlDb update: URL normalizing: true > > CrawlDb update: URL filtering: true > > > > CrawlDb update: Merging segment data into db. > > > > > > CrawlDb update: finished at 2011-07-17 09:31:34, elapsed: > > 00:00:02 > > : > > : > > : > > : > > > ----------------------------------------------------------------------------------------------- > > > > > > > > On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote: > > > > > Done, but now get additional errors: > > > > > > ------------------- > > > llist@LeosLinux:~/nutchData > > $ /usr/share/nutch/runtime/local/bin/nutch > > > updatedb /home/llist/nutchData/crawl/crawldb > > > -dir /home/llist/nutchData/crawl/segments/20110716105826 > > > CrawlDb update: starting at 2011-07-16 11:03:56 > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb > > > CrawlDb update: segments: > > > > > > [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch, > > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/content, > > > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse, > > > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data, > > > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate, > > > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text] > > > CrawlDb update: additions allowed: true > > > CrawlDb update: URL normalizing: false > > > CrawlDb update: URL filtering: false > > > - skipping invalid segment > > > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch > > > - skipping invalid segment > > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/content > > > - skipping invalid segment > > > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse > > > - skipping invalid segment > > > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data > > > - skipping invalid segment > > > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate > > > - skipping invalid segment > > > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text > > > CrawlDb update: Merging segment data into db. > > > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed: > > 00:00:01 > > > ------------------------------------------- > > > > > > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote: > > > > > > > fetch, then parse. > > > > > > > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are > > the commands and > > > > > relevant output. > > > > > > > > > > ---------------------------------- > > > > > llist@LeosLinux:~ > > $ /usr/share/nutch/runtime/local/bin/nutch > > > > > > > inject /home/llist/nutchData/crawl/crawldb > /home/llist/nutchData/seed > > > > > Injector: starting at 2011-07-15 18:32:10 > > > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb > > > > > Injector: urlDir: /home/llist/nutchData/seed > > > > > Injector: Converting injected urls to crawl db entries. > > > > > Injector: Merging injected urls into crawl db. > > > > > Injector: finished at 2011-07-15 18:32:13, elapsed: > > 00:00:02 > > > > > ================= > > > > > llist@LeosLinux:~ > > $ /usr/share/nutch/runtime/local/bin/nutch > > > > > generate /home/llist/nutchData/crawl/crawldb > > > > > /home/llist/nutchData/crawl/segments Generator: starting > > at 2011-07-15 > > > > > 18:32:41 > > > > > Generator: Selecting best-scoring urls due for fetch. > > > > > Generator: filtering: true > > > > > Generator: normalizing: true > > > > > Generator: jobtracker is 'local', generating exactly one > > partition. > > > > > Generator: Partitioning selected urls for politeness. > > > > > Generator: > > segment: /home/llist/nutchData/crawl/segments/20110715183244 > > > > > Generator: finished at 2011-07-15 18:32:45, elapsed: > > 00:00:03 > > > > > ================== > > > > > llist@LeosLinux:~ > > $ /usr/share/nutch/runtime/local/bin/nutch > > > > > > > fetch /home/llist/nutchData/crawl/segments/20110715183244 > > > > > Fetcher: Your 'http.agent.name' value should be listed > > first in > > > > > 'http.robots.agents' property. > > > > > Fetcher: starting at 2011-07-15 18:34:55 > > > > > Fetcher: > > segment: /home/llist/nutchData/crawl/segments/20110715183244 > > > > > Fetcher: threads: 10 > > > > > QueueFeeder finished: total 1 records + hit by time > > limit :0 > > > > > fetching http://www.seek.com.au/ > > > > > -finishing thread FetcherThread, activeThreads=1 > > > > > -finishing thread FetcherThread, activeThreads=1 > > > > > -finishing thread FetcherThread, activeThreads=1 > > > > > -finishing thread FetcherThread, activeThreads=1 > > > > > -finishing thread FetcherThread, activeThreads=2 > > > > > -finishing thread FetcherThread, activeThreads=1 > > > > > -finishing thread FetcherThread, activeThreads=1 > > > > > -finishing thread FetcherThread, activeThreads=1 > > > > > -finishing thread FetcherThread, activeThreads=1 > > > > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > > > > > -finishing thread FetcherThread, activeThreads=0 > > > > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > > > > -activeThreads=0 > > > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed: > > 00:00:03 > > > > > ================= > > > > > llist@LeosLinux:~ > > $ /usr/share/nutch/runtime/local/bin/nutch > > > > > updatedb /home/llist/nutchData/crawl/crawldb > > > > > -dir /home/llist/nutchData/crawl/segments/20110715183244 > > > > > CrawlDb update: starting at 2011-07-15 18:36:00 > > > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb > > > > > CrawlDb update: segments: > > > > > > > > [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch, > > > > > > > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate, > > > > > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content] > > > > > CrawlDb update: additions allowed: true > > > > > CrawlDb update: URL normalizing: false > > > > > CrawlDb update: URL filtering: false > > > > > - skipping invalid segment > > > > > > > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch > > > > > - skipping invalid segment > > > > > > > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate > > > > > - skipping invalid segment > > > > > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content > > > > > CrawlDb update: Merging segment data into db. > > > > > CrawlDb update: finished at 2011-07-15 18:36:01, > > elapsed: 00:00:01 > > > > > ----------------------------------- > > > > > > > > > > Appreciate any hints on what I'm missing. > > > > > > > > > > > > > > > > > > > > > > -- > > Lewis > > > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

