Hello, I think there is a mislead in the documentation, it does not tell us that we have to parse.
On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche <[email protected]> wrote: > Haven't you forgotten to call parse? > > On 19 July 2011 23:40, Leo Subscriptions <[email protected]> wrote: > >> Hi Lewis, >> >> You are correct about the last post not showing any errors. I just >> wanted to show that I don't get any errors if I use 'crawl' and to prove >> that I do not have any faults in the conf files or the directories. >> >> I still get the errors if I use the individual commands inject, >> generate, fetch.... >> >> Cheers, >> >> Leo >> >> >> >> On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote: >> >> > Hi Leo >> > >> > Did you resolve? >> > >> > Your second log data doesn't appear to show any errors however the >> > problem you specify if one I have witnessed myself while ago. Since >> > you posted have you been able to replicate... or resolve? >> > >> > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions >> > <[email protected]> wrote: >> > >> > I've used crawl to ensure config is correct and I don't get >> > any errors, >> > so I must be doing something wrong with the individual steps, >> > but can;t >> > see what. >> > >> > >> -------------------------------------------------------------------------------------------------------------------- >> > >> > llist@LeosLinux:~/nutchData >> > $ /usr/share/nutch/runtime/local/bin/nutch >> > >> > >> > crawl /home/llist/nutchData/seed/urls >> > -dir /home/llist/nutchData/crawl >> > -depth 3 -topN 5 >> > solrUrl is not set, indexing will be skipped... >> > crawl started in: /home/llist/nutchData/crawl >> > rootUrlDir = /home/llist/nutchData/seed/urls >> > threads = 10 >> > depth = 3 >> > solrUrl=null >> > topN = 5 >> > Injector: starting at 2011-07-17 09:31:19 >> > >> > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb >> > >> > >> > Injector: urlDir: /home/llist/nutchData/seed/urls >> > >> > Injector: Converting injected urls to crawl db entries. >> > Injector: Merging injected urls into crawl db. >> > >> > >> > Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02 >> > Generator: starting at 2011-07-17 09:31:22 >> > >> > Generator: Selecting best-scoring urls due for fetch. >> > Generator: filtering: true >> > Generator: normalizing: true >> > >> > >> > Generator: topN: 5 >> > >> > Generator: jobtracker is 'local', generating exactly one >> > partition. >> > Generator: Partitioning selected urls for politeness. >> > >> > >> > Generator: >> > segment: /home/llist/nutchData/crawl/segments/20110717093124 >> > Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04 >> > >> > Fetcher: Your 'http.agent.name' value should be listed first >> > in >> > 'http.robots.agents' property. >> > >> > >> > Fetcher: starting at 2011-07-17 09:31:26 >> > Fetcher: >> > segment: /home/llist/nutchData/crawl/segments/20110717093124 >> > >> > Fetcher: threads: 10 >> > QueueFeeder finished: total 1 records + hit by time limit :0 >> > fetching http://www.seek.com.au/ >> > -finishing thread FetcherThread, activeThreads=1 >> > -finishing thread FetcherThread, activeThreads=1 >> > -finishing thread FetcherThread, activeThreads=1 >> > -finishing thread FetcherThread, activeThreads=1 >> > -finishing thread FetcherThread, activeThreads=1 >> > -finishing thread FetcherThread, activeThreads=1 >> > -finishing thread FetcherThread, activeThreads=1 >> > -finishing thread FetcherThread, activeThreads=1 >> > >> > -finishing thread FetcherThread, activeThreads=1 >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 >> > -finishing thread FetcherThread, activeThreads=0 >> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >> > -activeThreads=0 >> > >> > >> > Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03 >> > ParseSegment: starting at 2011-07-17 09:31:29 >> > ParseSegment: >> > segment: /home/llist/nutchData/crawl/segments/20110717093124 >> > ParseSegment: finished at 2011-07-17 09:31:32, elapsed: >> > 00:00:02 >> > CrawlDb update: starting at 2011-07-17 09:31:32 >> > >> > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb >> > CrawlDb update: segments: >> > >> > >> > [/home/llist/nutchData/crawl/segments/20110717093124] >> > >> > CrawlDb update: additions allowed: true >> > >> > >> > CrawlDb update: URL normalizing: true >> > CrawlDb update: URL filtering: true >> > >> > CrawlDb update: Merging segment data into db. >> > >> > >> > CrawlDb update: finished at 2011-07-17 09:31:34, elapsed: >> > 00:00:02 >> > : >> > : >> > : >> > : >> > >> ----------------------------------------------------------------------------------------------- >> > >> > >> > >> > On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote: >> > >> > > Done, but now get additional errors: >> > > >> > > ------------------- >> > > llist@LeosLinux:~/nutchData >> > $ /usr/share/nutch/runtime/local/bin/nutch >> > > updatedb /home/llist/nutchData/crawl/crawldb >> > > -dir /home/llist/nutchData/crawl/segments/20110716105826 >> > > CrawlDb update: starting at 2011-07-16 11:03:56 >> > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb >> > > CrawlDb update: segments: >> > > >> > >> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch, >> > > >> > file:/home/llist/nutchData/crawl/segments/20110716105826/content, >> > > >> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse, >> > > >> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data, >> > > >> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate, >> > > >> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text] >> > > CrawlDb update: additions allowed: true >> > > CrawlDb update: URL normalizing: false >> > > CrawlDb update: URL filtering: false >> > > - skipping invalid segment >> > > >> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch >> > > - skipping invalid segment >> > > >> > file:/home/llist/nutchData/crawl/segments/20110716105826/content >> > > - skipping invalid segment >> > > >> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse >> > > - skipping invalid segment >> > > >> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data >> > > - skipping invalid segment >> > > >> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate >> > > - skipping invalid segment >> > > >> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text >> > > CrawlDb update: Merging segment data into db. >> > > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed: >> > 00:00:01 >> > > ------------------------------------------- >> > > >> > > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote: >> > > >> > > > fetch, then parse. >> > > > >> > > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are >> > the commands and >> > > > > relevant output. >> > > > > >> > > > > ---------------------------------- >> > > > > llist@LeosLinux:~ >> > $ /usr/share/nutch/runtime/local/bin/nutch >> > > > > >> > inject /home/llist/nutchData/crawl/crawldb >> /home/llist/nutchData/seed >> > > > > Injector: starting at 2011-07-15 18:32:10 >> > > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb >> > > > > Injector: urlDir: /home/llist/nutchData/seed >> > > > > Injector: Converting injected urls to crawl db entries. >> > > > > Injector: Merging injected urls into crawl db. >> > > > > Injector: finished at 2011-07-15 18:32:13, elapsed: >> > 00:00:02 >> > > > > ================= >> > > > > llist@LeosLinux:~ >> > $ /usr/share/nutch/runtime/local/bin/nutch >> > > > > generate /home/llist/nutchData/crawl/crawldb >> > > > > /home/llist/nutchData/crawl/segments Generator: starting >> > at 2011-07-15 >> > > > > 18:32:41 >> > > > > Generator: Selecting best-scoring urls due for fetch. >> > > > > Generator: filtering: true >> > > > > Generator: normalizing: true >> > > > > Generator: jobtracker is 'local', generating exactly one >> > partition. >> > > > > Generator: Partitioning selected urls for politeness. >> > > > > Generator: >> > segment: /home/llist/nutchData/crawl/segments/20110715183244 >> > > > > Generator: finished at 2011-07-15 18:32:45, elapsed: >> > 00:00:03 >> > > > > ================== >> > > > > llist@LeosLinux:~ >> > $ /usr/share/nutch/runtime/local/bin/nutch >> > > > > >> > fetch /home/llist/nutchData/crawl/segments/20110715183244 >> > > > > Fetcher: Your 'http.agent.name' value should be listed >> > first in >> > > > > 'http.robots.agents' property. >> > > > > Fetcher: starting at 2011-07-15 18:34:55 >> > > > > Fetcher: >> > segment: /home/llist/nutchData/crawl/segments/20110715183244 >> > > > > Fetcher: threads: 10 >> > > > > QueueFeeder finished: total 1 records + hit by time >> > limit :0 >> > > > > fetching http://www.seek.com.au/ >> > > > > -finishing thread FetcherThread, activeThreads=1 >> > > > > -finishing thread FetcherThread, activeThreads=1 >> > > > > -finishing thread FetcherThread, activeThreads=1 >> > > > > -finishing thread FetcherThread, activeThreads=1 >> > > > > -finishing thread FetcherThread, activeThreads=2 >> > > > > -finishing thread FetcherThread, activeThreads=1 >> > > > > -finishing thread FetcherThread, activeThreads=1 >> > > > > -finishing thread FetcherThread, activeThreads=1 >> > > > > -finishing thread FetcherThread, activeThreads=1 >> > > > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 >> > > > > -finishing thread FetcherThread, activeThreads=0 >> > > > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >> > > > > -activeThreads=0 >> > > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed: >> > 00:00:03 >> > > > > ================= >> > > > > llist@LeosLinux:~ >> > $ /usr/share/nutch/runtime/local/bin/nutch >> > > > > updatedb /home/llist/nutchData/crawl/crawldb >> > > > > -dir /home/llist/nutchData/crawl/segments/20110715183244 >> > > > > CrawlDb update: starting at 2011-07-15 18:36:00 >> > > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb >> > > > > CrawlDb update: segments: >> > > > > >> > >> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch, >> > > > > >> > >> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate, >> > > > > >> > file:/home/llist/nutchData/crawl/segments/20110715183244/content] >> > > > > CrawlDb update: additions allowed: true >> > > > > CrawlDb update: URL normalizing: false >> > > > > CrawlDb update: URL filtering: false >> > > > > - skipping invalid segment >> > > > > >> > >> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch >> > > > > - skipping invalid segment >> > > > > >> > >> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate >> > > > > - skipping invalid segment >> > > > > >> > file:/home/llist/nutchData/crawl/segments/20110715183244/content >> > > > > CrawlDb update: Merging segment data into db. >> > > > > CrawlDb update: finished at 2011-07-15 18:36:01, >> > elapsed: 00:00:01 >> > > > > ----------------------------------- >> > > > > >> > > > > Appreciate any hints on what I'm missing. >> > > >> > > >> > >> > >> > >> > >> > >> > >> > >> > -- >> > Lewis >> > >> >> >> > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >

