Following are the suggested commands and the result as suggested I left the redirect as 0 as 'crawl' works without any issues. The problem only occurs when running the individual commands.
------- nutch-site.xml ------------------------------- <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>listers spider</value> </property> <property> <name>fetcher.verbose</name> <value>true</value> <description>If true, fetcher will log more verbosely.</description> </property> <property> <name>http.verbose</name> <value>true</value> <description>If true, HTTP will log more verbosely.</description> </property> </configuration> --------------------------------------------------------------- ------ Individual commands and results------------------------- llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed/urls Injector: starting at 2011-07-21 12:24:52 Injector: crawlDb: /home/llist/nutchData/crawl/crawldb Injector: urlDir: /home/llist/nutchData/seed/urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch generate /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/crawl/segments -topN 100 Generator: starting at 2011-07-21 12:25:16 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 100 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /home/llist/nutchData/crawl/segments/20110721122519 Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch fetch /home/llist/nutchData/crawl/segments/20110721122519 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-07-21 12:26:36 Fetcher: segment: /home/llist/nutchData/crawl/segments/20110721122519 Fetcher: threads: 10 QueueFeeder finished: total 1 records + hit by time limit :0 -finishing thread FetcherThread, activeThreads=1 fetching http://wiki.apache.org/ -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2011-07-21 12:26:40, elapsed: 00:00:04 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch parse /home/llist/nutchData/crawl/segments/20110721122519 ParseSegment: starting at 2011-07-21 12:27:22 ParseSegment: segment: /home/llist/nutchData/crawl/segments/20110721122519 ParseSegment: finished at 2011-07-21 12:27:24, elapsed: 00:00:01 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch updatedb /home/llist/nutchData/crawl/crawldb -dir /home/llist/nutchData/crawl/segments/20110721122519 CrawlDb update: starting at 2011-07-21 12:28:03 CrawlDb update: db: /home/llist/nutchData/crawl/crawldb CrawlDb update: segments: [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text, file:/home/llist/nutchData/crawl/segments/20110721122519/content, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse, file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: false CrawlDb update: URL filtering: false - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/content - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch - skipping invalid segment file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2011-07-21 12:28:04, elapsed: 00:00:01 ------------------------------------------------------------------------------------ On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote: > There is no documentation for individual commands used to run a Nutch 1.3 > crawl so I'm not sure where there has been a mislead. In the instance that > this was required I would direct newer users to the legacy documentation for > the time being. > > My comment to Leo was to understand whether he managed to correct the > invalid segments problem. > > Leo, if this still persists may I ask you to try again, I will do the same > and will be happy to provide feedback > > May I suggest the following > > > use the following commands > > inject > generate > fetch > parse > updatedb > > At this stage we should be able to ascertain if something is correct and > hopefully debug. May I add the following... please make the following > additions to nutch-site. > > fetcher verbose - true > http verbose - true > check for redirects and set accordingly > > > On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche < > [email protected]> wrote: > > > The wiki can be edited and you are welcome to suggest improvements if there > > is something missing > > > > On 20 July 2011 13:31, Cam Bazz <[email protected]> wrote: > > > > > Hello, > > > > > > I think there is a mislead in the documentation, it does not tell us > > > that we have to parse. > > > > > > On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche > > > <[email protected]> wrote: > > > > Haven't you forgotten to call parse? > > > > > > > > On 19 July 2011 23:40, Leo Subscriptions <[email protected]> > > > wrote: > > > > > > > >> Hi Lewis, > > > >> > > > >> You are correct about the last post not showing any errors. I just > > > >> wanted to show that I don't get any errors if I use 'crawl' and to > > prove > > > >> that I do not have any faults in the conf files or the directories. > > > >> > > > >> I still get the errors if I use the individual commands inject, > > > >> generate, fetch.... > > > >> > > > >> Cheers, > > > >> > > > >> Leo > > > >> > > > >> > > > >> > > > >> On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote: > > > >> > > > >> > Hi Leo > > > >> > > > > >> > Did you resolve? > > > >> > > > > >> > Your second log data doesn't appear to show any errors however the > > > >> > problem you specify if one I have witnessed myself while ago. Since > > > >> > you posted have you been able to replicate... or resolve? > > > >> > > > > >> > > > > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions > > > >> > <[email protected]> wrote: > > > >> > > > > >> > I've used crawl to ensure config is correct and I don't get > > > >> > any errors, > > > >> > so I must be doing something wrong with the individual > > steps, > > > >> > but can;t > > > >> > see what. > > > >> > > > > >> > > > > >> > > > > > -------------------------------------------------------------------------------------------------------------------- > > > >> > > > > >> > llist@LeosLinux:~/nutchData > > > >> > $ /usr/share/nutch/runtime/local/bin/nutch > > > >> > > > > >> > > > > >> > crawl /home/llist/nutchData/seed/urls > > > >> > -dir /home/llist/nutchData/crawl > > > >> > -depth 3 -topN 5 > > > >> > solrUrl is not set, indexing will be skipped... > > > >> > crawl started in: /home/llist/nutchData/crawl > > > >> > rootUrlDir = /home/llist/nutchData/seed/urls > > > >> > threads = 10 > > > >> > depth = 3 > > > >> > solrUrl=null > > > >> > topN = 5 > > > >> > Injector: starting at 2011-07-17 09:31:19 > > > >> > > > > >> > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb > > > >> > > > > >> > > > > >> > Injector: urlDir: /home/llist/nutchData/seed/urls > > > >> > > > > >> > Injector: Converting injected urls to crawl db entries. > > > >> > Injector: Merging injected urls into crawl db. > > > >> > > > > >> > > > > >> > Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02 > > > >> > Generator: starting at 2011-07-17 09:31:22 > > > >> > > > > >> > Generator: Selecting best-scoring urls due for fetch. > > > >> > Generator: filtering: true > > > >> > Generator: normalizing: true > > > >> > > > > >> > > > > >> > Generator: topN: 5 > > > >> > > > > >> > Generator: jobtracker is 'local', generating exactly one > > > >> > partition. > > > >> > Generator: Partitioning selected urls for politeness. > > > >> > > > > >> > > > > >> > Generator: > > > >> > segment: /home/llist/nutchData/crawl/segments/20110717093124 > > > >> > Generator: finished at 2011-07-17 09:31:26, elapsed: > > 00:00:04 > > > >> > > > > >> > Fetcher: Your 'http.agent.name' value should be listed > > first > > > >> > in > > > >> > 'http.robots.agents' property. > > > >> > > > > >> > > > > >> > Fetcher: starting at 2011-07-17 09:31:26 > > > >> > Fetcher: > > > >> > segment: /home/llist/nutchData/crawl/segments/20110717093124 > > > >> > > > > >> > Fetcher: threads: 10 > > > >> > QueueFeeder finished: total 1 records + hit by time limit :0 > > > >> > fetching http://www.seek.com.au/ > > > >> > -finishing thread FetcherThread, activeThreads=1 > > > >> > -finishing thread FetcherThread, activeThreads=1 > > > >> > -finishing thread FetcherThread, activeThreads=1 > > > >> > -finishing thread FetcherThread, activeThreads=1 > > > >> > -finishing thread FetcherThread, activeThreads=1 > > > >> > -finishing thread FetcherThread, activeThreads=1 > > > >> > -finishing thread FetcherThread, activeThreads=1 > > > >> > -finishing thread FetcherThread, activeThreads=1 > > > >> > > > > >> > -finishing thread FetcherThread, activeThreads=1 > > > >> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 > > > >> > -finishing thread FetcherThread, activeThreads=0 > > > >> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > > >> > -activeThreads=0 > > > >> > > > > >> > > > > >> > Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03 > > > >> > ParseSegment: starting at 2011-07-17 09:31:29 > > > >> > ParseSegment: > > > >> > segment: /home/llist/nutchData/crawl/segments/20110717093124 > > > >> > ParseSegment: finished at 2011-07-17 09:31:32, elapsed: > > > >> > 00:00:02 > > > >> > CrawlDb update: starting at 2011-07-17 09:31:32 > > > >> > > > > >> > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb > > > >> > CrawlDb update: segments: > > > >> > > > > >> > > > > >> > [/home/llist/nutchData/crawl/segments/20110717093124] > > > >> > > > > >> > CrawlDb update: additions allowed: true > > > >> > > > > >> > > > > >> > CrawlDb update: URL normalizing: true > > > >> > CrawlDb update: URL filtering: true > > > >> > > > > >> > CrawlDb update: Merging segment data into db. > > > >> > > > > >> > > > > >> > CrawlDb update: finished at 2011-07-17 09:31:34, elapsed: > > > >> > 00:00:02 > > > >> > : > > > >> > : > > > >> > : > > > >> > : > > > >> > > > > >> > > > > > ----------------------------------------------------------------------------------------------- > > > >> > > > > >> > > > > >> > > > > >> > On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote: > > > >> > > > > >> > > Done, but now get additional errors: > > > >> > > > > > >> > > ------------------- > > > >> > > llist@LeosLinux:~/nutchData > > > >> > $ /usr/share/nutch/runtime/local/bin/nutch > > > >> > > updatedb /home/llist/nutchData/crawl/crawldb > > > >> > > -dir /home/llist/nutchData/crawl/segments/20110716105826 > > > >> > > CrawlDb update: starting at 2011-07-16 11:03:56 > > > >> > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb > > > >> > > CrawlDb update: segments: > > > >> > > > > > >> > > > > >> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch, > > > >> > > > > > >> > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/content, > > > >> > > > > > >> > > > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse, > > > >> > > > > > >> > > > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data, > > > >> > > > > > >> > > > > >> > > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate, > > > >> > > > > > >> > > > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text] > > > >> > > CrawlDb update: additions allowed: true > > > >> > > CrawlDb update: URL normalizing: false > > > >> > > CrawlDb update: URL filtering: false > > > >> > > - skipping invalid segment > > > >> > > > > > >> > > > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch > > > >> > > - skipping invalid segment > > > >> > > > > > >> > > > > file:/home/llist/nutchData/crawl/segments/20110716105826/content > > > >> > > - skipping invalid segment > > > >> > > > > > >> > > > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse > > > >> > > - skipping invalid segment > > > >> > > > > > >> > > > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data > > > >> > > - skipping invalid segment > > > >> > > > > > >> > > > > >> > > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate > > > >> > > - skipping invalid segment > > > >> > > > > > >> > > > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text > > > >> > > CrawlDb update: Merging segment data into db. > > > >> > > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed: > > > >> > 00:00:01 > > > >> > > ------------------------------------------- > > > >> > > > > > >> > > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote: > > > >> > > > > > >> > > > fetch, then parse. > > > >> > > > > > > >> > > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are > > > >> > the commands and > > > >> > > > > relevant output. > > > >> > > > > > > > >> > > > > ---------------------------------- > > > >> > > > > llist@LeosLinux:~ > > > >> > $ /usr/share/nutch/runtime/local/bin/nutch > > > >> > > > > > > > >> > inject /home/llist/nutchData/crawl/crawldb > > > >> /home/llist/nutchData/seed > > > >> > > > > Injector: starting at 2011-07-15 18:32:10 > > > >> > > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb > > > >> > > > > Injector: urlDir: /home/llist/nutchData/seed > > > >> > > > > Injector: Converting injected urls to crawl db > > entries. > > > >> > > > > Injector: Merging injected urls into crawl db. > > > >> > > > > Injector: finished at 2011-07-15 18:32:13, elapsed: > > > >> > 00:00:02 > > > >> > > > > ================= > > > >> > > > > llist@LeosLinux:~ > > > >> > $ /usr/share/nutch/runtime/local/bin/nutch > > > >> > > > > generate /home/llist/nutchData/crawl/crawldb > > > >> > > > > /home/llist/nutchData/crawl/segments Generator: > > starting > > > >> > at 2011-07-15 > > > >> > > > > 18:32:41 > > > >> > > > > Generator: Selecting best-scoring urls due for fetch. > > > >> > > > > Generator: filtering: true > > > >> > > > > Generator: normalizing: true > > > >> > > > > Generator: jobtracker is 'local', generating exactly > > one > > > >> > partition. > > > >> > > > > Generator: Partitioning selected urls for politeness. > > > >> > > > > Generator: > > > >> > segment: /home/llist/nutchData/crawl/segments/20110715183244 > > > >> > > > > Generator: finished at 2011-07-15 18:32:45, elapsed: > > > >> > 00:00:03 > > > >> > > > > ================== > > > >> > > > > llist@LeosLinux:~ > > > >> > $ /usr/share/nutch/runtime/local/bin/nutch > > > >> > > > > > > > >> > fetch /home/llist/nutchData/crawl/segments/20110715183244 > > > >> > > > > Fetcher: Your 'http.agent.name' value should be > > listed > > > >> > first in > > > >> > > > > 'http.robots.agents' property. > > > >> > > > > Fetcher: starting at 2011-07-15 18:34:55 > > > >> > > > > Fetcher: > > > >> > segment: /home/llist/nutchData/crawl/segments/20110715183244 > > > >> > > > > Fetcher: threads: 10 > > > >> > > > > QueueFeeder finished: total 1 records + hit by time > > > >> > limit :0 > > > >> > > > > fetching http://www.seek.com.au/ > > > >> > > > > -finishing thread FetcherThread, activeThreads=1 > > > >> > > > > -finishing thread FetcherThread, activeThreads=1 > > > >> > > > > -finishing thread FetcherThread, activeThreads=1 > > > >> > > > > -finishing thread FetcherThread, activeThreads=1 > > > >> > > > > -finishing thread FetcherThread, activeThreads=2 > > > >> > > > > -finishing thread FetcherThread, activeThreads=1 > > > >> > > > > -finishing thread FetcherThread, activeThreads=1 > > > >> > > > > -finishing thread FetcherThread, activeThreads=1 > > > >> > > > > -finishing thread FetcherThread, activeThreads=1 > > > >> > > > > -activeThreads=1, spinWaiting=0, > > fetchQueues.totalSize=0 > > > >> > > > > -finishing thread FetcherThread, activeThreads=0 > > > >> > > > > -activeThreads=0, spinWaiting=0, > > fetchQueues.totalSize=0 > > > >> > > > > -activeThreads=0 > > > >> > > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed: > > > >> > 00:00:03 > > > >> > > > > ================= > > > >> > > > > llist@LeosLinux:~ > > > >> > $ /usr/share/nutch/runtime/local/bin/nutch > > > >> > > > > updatedb /home/llist/nutchData/crawl/crawldb > > > >> > > > > -dir > > /home/llist/nutchData/crawl/segments/20110715183244 > > > >> > > > > CrawlDb update: starting at 2011-07-15 18:36:00 > > > >> > > > > CrawlDb update: db: > > /home/llist/nutchData/crawl/crawldb > > > >> > > > > CrawlDb update: segments: > > > >> > > > > > > > >> > > > > >> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch, > > > >> > > > > > > > >> > > > > >> > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate, > > > >> > > > > > > > >> > > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content] > > > >> > > > > CrawlDb update: additions allowed: true > > > >> > > > > CrawlDb update: URL normalizing: false > > > >> > > > > CrawlDb update: URL filtering: false > > > >> > > > > - skipping invalid segment > > > >> > > > > > > > >> > > > > >> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch > > > >> > > > > - skipping invalid segment > > > >> > > > > > > > >> > > > > >> > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate > > > >> > > > > - skipping invalid segment > > > >> > > > > > > > >> > > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content > > > >> > > > > CrawlDb update: Merging segment data into db. > > > >> > > > > CrawlDb update: finished at 2011-07-15 18:36:01, > > > >> > elapsed: 00:00:01 > > > >> > > > > ----------------------------------- > > > >> > > > > > > > >> > > > > Appreciate any hints on what I'm missing. > > > >> > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > -- > > > >> > Lewis > > > >> > > > > >> > > > >> > > > >> > > > > > > > > > > > > -- > > > > * > > > > *Open Source Solutions for Text Engineering > > > > > > > > http://digitalpebble.blogspot.com/ > > > > http://www.digitalpebble.com > > > > > > > > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > > > >

