Re: skipping invalid segments nutch 1.3

Leo Subscriptions Thu, 21 Jul 2011 17:40:35 -0700

Hi Lewis,

Following  are the things I tried ans the relevant source/logs



1. ran 'crawl' without  ending "/" in the url http://www.seek.com.au ;
Result OK
2. ran 'crawl' with ending "/" in the url http://www.seek.com.au/ ;
Result OK
3. Had a look at the regex-urlfilter.txt and the relevant entries are as
follows

----------- regex-urlfilter.txt -----------------
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.
----------------------------------------------------------
4. I think you are correct in that fetch does not actually fetch
anything. Following are the relevant sections from the hadoop.log. First
the log when 'crawl' was running and then the log for 'inject, generate,
fetch'. The rest of the log up to the fetch is pretty much identical.
One thing I did notice is that the QueueFeeder returns 10 records for
'crawl' and 1 record for 'fetch'

--------- hadoop.log for 'crawl' -----------

2011-07-22 10:02:27,226 INFO  crawl.Generator - Generator: finished at
2011-07-22 10:02:27, elapsed: 00:00:03
2011-07-22 10:02:27,227 WARN  fetcher.Fetcher - Fetcher: Your
'http.agent.name' value should be listed first in 'http.robots.agents'
property.
2011-07-22 10:02:27,228 INFO  fetcher.Fetcher - Fetcher: starting at
2011-07-22 10:02:27
2011-07-22 10:02:27,228 INFO  fetcher.Fetcher - Fetcher:
segment: /home/llist/nutchData/crawl/segments/20110722100225
2011-07-22 10:02:27,910 INFO  fetcher.Fetcher - Fetcher: threads: 10
2011-07-22 10:02:27,918 INFO  fetcher.Fetcher - QueueFeeder finished:
total 10 records + hit by time limit :0
2011-07-22 10:02:27,926 INFO  fetcher.Fetcher - fetching
http://www.seek.com.au/sales-jobs
2011-07-22 10:02:27,940 INFO  http.Http - http.proxy.host = null
2011-07-22 10:02:27,940 INFO  http.Http - http.proxy.port = 8080
2011-07-22 10:02:27,940 INFO  http.Http - http.timeout = 10000
2011-07-22 10:02:27,940 INFO  http.Http - http.content.limit = 65536
2011-07-22 10:02:27,940 INFO  http.Http - http.agent = listers
spider/Nutch-1.3
2011-07-22 10:02:27,940 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2011-07-22 10:02:28,929 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=9, fetchQueues.totalSize=9
2011-07-22 10:02:29,929 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=9, fetchQueues.totalSize=9
2011-07-22 10:02:30,930 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:31,930 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:32,931 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:33,931 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:34,932 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:35,091 INFO  fetcher.Fetcher - fetching
http://www.seek.com.au/mining-resources-energy-jobs/
2011-07-22 10:02:35,933 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:36,933 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:37,933 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:38,934 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:39,934 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:40,363 INFO  fetcher.Fetcher - fetching
http://www.seek.com.au/marketing-communications-jobs/
2011-07-22 10:02:40,934 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=7


etc.

-----------------------------------------------------------------------------------------

------- hadoop.log for 'fetch'
-------------------------------------------
2011-07-22 10:14:37,645 INFO  crawl.Generator - Generator: finished at
2011-07-22 10:14:37, elapsed: 00:00:03
2011-07-22 10:16:46,088 WARN  fetcher.Fetcher - Fetcher: Your
'http.agent.name' value should be listed first in 'http.robots.agents'
property.
2011-07-22 10:16:46,089 INFO  fetcher.Fetcher - Fetcher: starting at
2011-07-22 10:16:46
2011-07-22 10:16:46,089 INFO  fetcher.Fetcher - Fetcher:
segment: /home/llist/nutchData/crawl/segments/20110722101436
2011-07-22 10:16:46,720 INFO  fetcher.Fetcher - Fetcher: threads: 10
2011-07-22 10:16:46,741 INFO  plugin.PluginRepository - Plugins: looking
in: /usr/share/nutch/runtime/local/plugins
2011-07-22 10:16:46,746 INFO  fetcher.Fetcher - QueueFeeder finished:
total 1 records + hit by time limit :0
2011-07-22 10:16:46,815 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]

---------------------------------------------------------------------------------------

Cheers,

Leo


On Fri, 2011-07-22 at 09:51 +1000, Leo Subscriptions wrote:

> Hi Lewis,
> 
> Will try your suggestion shortly, but am still puzzled why the crawl
> command works. Isn't it using the same filter, etc?
> 
> Cheers,
> 
> Leo
> 
> On Thu, 2011-07-21 at 20:55 +0100, lewis john mcgibbney wrote:
> 
> > Hi Leo,
> > 
> > From the times both the fetching and parsing took, I suspecting that
> > maybe Nutch didn't actually fetch the URL, however this may not be the
> > case as I have nothing to benchmark it on. Unfortuantely on the
> > occasion the URL http://wiki.apache.org actually redirects to
> > http://wiki.apache.org/general/ so I'm going to post my log output
> > from last URL you specified in an attempt to clear this one up. The
> > following confirms that you are accurate with your observations that
> > not only does this produce invalid segments but also nothing is
> > fetched in the process.
> > 
> > Therefore the reason that we are getting the  - skipping invalid
> > segment message is that we are not actually fetching any content. My
> > initial thoughts were that your urlfilters were not set properly and I
> > think that this is part of the case.
> > 
> > Please follow the syntax very carefully and it will work perfectly for
> > you as follows
> > 
> > regex-urlfilter.txt
> > --------------------------
> > 
> > # skip URLs with slash-delimited segment that repeats 3+ times, to
> > break loops
> > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > 
> > # crawl URLs in the following domains.
> > +^http://([a-z0-9]*\.)*seek.com.au/
> > 
> > # accept anything else
> > #+.
> > 
> > seed file
> > ----------------------
> > http://www.seek.com.au
> > 
> > It sounds really trivial but I think that the trailing '/' in in your
> > seed file may have been making all of the difference.
> > 
> > Please try, test with readdb and readseg and comment back.
> > 
> > Sorry for the delayed posts on this one I have not had much time to
> > get to it. Hope all goes to plan. Evidence can be seen below
> > 
> > lewis@lewis-01:~/ASF/branch-1.4/runtime/local$ bin/nutch readdb
> > crawldb -stats
> > CrawlDb statistics start: crawldb
> > Statistics for CrawlDb: crawldb
> > TOTAL urls:    48
> > retry 0:    48
> > min score:    0.017
> > avg score:    0.041125
> > max score:    1.175
> > status 1 (db_unfetched):    47
> > status 2 (db_fetched):    1
> > CrawlDb statistics: done
> > 
> > 
> > 
> > 
> > 
> > 
> > On Thu, Jul 21, 2011 at 3:30 AM, Leo Subscriptions
> > <[email protected]> wrote:
> > 
> >         Following are the suggested commands and the result as
> >         suggested
> >          I left the redirect as 0 as 'crawl' works without any issues.
> >         The
> >         problem only occurs when running the individual commands.
> >         
> >         ------- nutch-site.xml -------------------------------
> >         <?xml version="1.0"?>
> >         <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >         
> >         <!-- Put site-specific property overrides in this file. -->
> >         
> >         <configuration>
> >         
> >         <property>
> >          <name>http.agent.name</name>
> >          <value>listers spider</value>
> >         </property>
> >         
> >         <property>
> >          <name>fetcher.verbose</name>
> >          <value>true</value>
> >          <description>If true, fetcher will log more
> >         verbosely.</description>
> >         </property>
> >         
> >         <property>
> >          <name>http.verbose</name>
> >          <value>true</value>
> >          <description>If true, HTTP will log more
> >         verbosely.</description>
> >         </property>
> >         
> >         </configuration>
> >         ---------------------------------------------------------------
> >         
> >         ------ Individual commands and
> >         results-------------------------
> >         
> >         
> >         llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         
> >         
> >         inject /home/llist/nutchData/crawl/crawldb 
> > /home/llist/nutchData/seed/urls
> >         Injector: starting at 2011-07-21 12:24:52
> >         
> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> >         Injector: urlDir: /home/llist/nutchData/seed/urls
> >         Injector: Converting injected urls to crawl db entries.
> >         Injector: Merging injected urls into crawl db.
> >         
> >         
> >         Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02
> >         
> >         
> >         
> >         llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         
> >         
> >         generate /home/llist/nutchData/crawl/crawldb 
> > /home/llist/nutchData/crawl/segments -topN 100
> >         Generator: starting at 2011-07-21 12:25:16
> >         
> >         Generator: Selecting best-scoring urls due for fetch.
> >         Generator: filtering: true
> >         Generator: normalizing: true
> >         
> >         
> >         Generator: topN: 100
> >         
> >         Generator: jobtracker is 'local', generating exactly one
> >         partition.
> >         Generator: Partitioning selected urls for politeness.
> >         
> >         
> >         Generator:
> >         segment: /home/llist/nutchData/crawl/segments/20110721122519
> >         Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03
> >         
> >         
> >         
> >         llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         
> >         
> >         fetch /home/llist/nutchData/crawl/segments/20110721122519
> >         
> >         Fetcher: Your 'http.agent.name' value should be listed first
> >         in
> >         'http.robots.agents' property.
> >         
> >         
> >         Fetcher: starting at 2011-07-21 12:26:36
> >         Fetcher:
> >         segment: /home/llist/nutchData/crawl/segments/20110721122519
> >         
> >         Fetcher: threads: 10
> >         QueueFeeder finished: total 1 records + hit by time limit :0
> >         
> >         -finishing thread FetcherThread, activeThreads=1
> >         
> >         
> >         fetching http://wiki.apache.org/
> >         
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >         -finishing thread FetcherThread, activeThreads=0
> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >         -activeThreads=0
> >         
> >         
> >         Fetcher: finished at 2011-07-21 12:26:40, elapsed: 00:00:04
> >         
> >         
> >         
> >         llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         
> >         
> >         parse /home/llist/nutchData/crawl/segments/20110721122519
> >         ParseSegment: starting at 2011-07-21 12:27:22
> >         ParseSegment:
> >         segment: /home/llist/nutchData/crawl/segments/20110721122519
> >         ParseSegment: finished at 2011-07-21 12:27:24, elapsed:
> >         00:00:01
> >         
> >         
> >         
> >         llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         updatedb /home/llist/nutchData/crawl/crawldb
> >         
> >         
> >         -dir /home/llist/nutchData/crawl/segments/20110721122519
> >         CrawlDb update: starting at 2011-07-21 12:28:03
> >         
> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> >         CrawlDb update: segments:
> >         
> >         
> >         
> > [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/content,
> >         
> > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
> >         
> > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
> >         
> > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
> >         
> >         CrawlDb update: additions allowed: true
> >         CrawlDb update: URL normalizing: false
> >         CrawlDb update: URL filtering: false
> >          - skipping invalid segment
> >         
> >         
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text
> >          - skipping invalid segment
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/content
> >          - skipping invalid segment
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse
> >          - skipping invalid segment
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data
> >          - skipping invalid segment
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch
> >          - skipping invalid segment
> >         
> > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate
> >         
> >         CrawlDb update: Merging segment data into db.
> >         
> >         
> >         CrawlDb update: finished at 2011-07-21 12:28:04, elapsed:
> >         00:00:01
> >         
> >         
> > ------------------------------------------------------------------------------------
> >         
> >         
> >         
> >         
> >         
> >         On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote:
> >         
> >         > There is no documentation for individual commands used to
> >         run a Nutch 1.3
> >         > crawl so I'm not sure where there has been a mislead. In the
> >         instance that
> >         > this was required I would direct newer users to the legacy
> >         documentation for
> >         > the time being.
> >         >
> >         > My comment to Leo was to understand whether he managed to
> >         correct the
> >         > invalid segments problem.
> >         >
> >         > Leo, if this still persists may I ask you to try again, I
> >         will do the same
> >         > and will be happy to provide feedback
> >         >
> >         > May I suggest the following
> >         >
> >         >
> >         > use the following commands
> >         >
> >         > inject
> >         > generate
> >         > fetch
> >         > parse
> >         > updatedb
> >         >
> >         > At this stage we should be able to ascertain if something is
> >         correct and
> >         > hopefully debug. May I add the following... please make the
> >         following
> >         > additions to nutch-site.
> >         >
> >         > fetcher verbose - true
> >         > http verbose - true
> >         > check for redirects and set accordingly
> >         >
> >         >
> >         > On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche <
> >         > [email protected]> wrote:
> >         >
> >         > > The wiki can be edited and you are welcome to suggest
> >         improvements if there
> >         > > is something missing
> >         > >
> >         > > On 20 July 2011 13:31, Cam Bazz <[email protected]> wrote:
> >         > >
> >         > > > Hello,
> >         > > >
> >         > > > I think there is a mislead in the documentation, it does
> >         not tell us
> >         > > > that we have to parse.
> >         > > >
> >         > > > On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
> >         > > > <[email protected]> wrote:
> >         > > > > Haven't you forgotten to call parse?
> >         > > > >
> >         > > > > On 19 July 2011 23:40, Leo Subscriptions
> >         <[email protected]>
> >         > > > wrote:
> >         > > > >
> >         > > > >> Hi Lewis,
> >         > > > >>
> >         > > > >> You are correct about the last post not showing any
> >         errors. I just
> >         > > > >> wanted to show that I don't get any errors if I use
> >         'crawl' and to
> >         > > prove
> >         > > > >> that I do not have any faults in the conf files or
> >         the directories.
> >         > > > >>
> >         > > > >> I still get the errors if I use the individual
> >         commands inject,
> >         > > > >> generate, fetch....
> >         > > > >>
> >         > > > >> Cheers,
> >         > > > >>
> >         > > > >> Leo
> >         > > > >>
> >         > > > >>
> >         > > > >>
> >         > > > >>  On Tue, 2011-07-19 at 22:09 +0100, lewis john
> >         mcgibbney wrote:
> >         > > > >>
> >         > > > >> > Hi Leo
> >         > > > >> >
> >         > > > >> > Did you resolve?
> >         > > > >> >
> >         > > > >> > Your second log data doesn't appear to show any
> >         errors however the
> >         > > > >> > problem you specify if one I have witnessed myself
> >         while ago. Since
> >         > > > >> > you posted have you been able to replicate... or
> >         resolve?
> >         > > > >> >
> >         > > > >> >
> >         > > > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> >         > > > >> > <[email protected]> wrote:
> >         > > > >> >
> >         > > > >> >         I've used crawl to ensure config is correct
> >         and I don't get
> >         > > > >> >         any errors,
> >         > > > >> >         so I must be doing something wrong with the
> >         individual
> >         > > steps,
> >         > > > >> >         but can;t
> >         > > > >> >         see what.
> >         > > > >> >
> >         > > > >> >
> >         > > > >>
> >         > > >
> >         > >
> >         
> > --------------------------------------------------------------------------------------------------------------------
> >         > > > >> >
> >         > > > >> >         llist@LeosLinux:~/nutchData
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         crawl /home/llist/nutchData/seed/urls
> >         > > > >> >         -dir /home/llist/nutchData/crawl
> >         > > > >> >         -depth 3 -topN 5
> >         > > > >> >         solrUrl is not set, indexing will be
> >         skipped...
> >         > > > >> >         crawl started
> >         in: /home/llist/nutchData/crawl
> >         > > > >> >         rootUrlDir
> >         = /home/llist/nutchData/seed/urls
> >         > > > >> >         threads = 10
> >         > > > >> >         depth = 3
> >         > > > >> >         solrUrl=null
> >         > > > >> >         topN = 5
> >         > > > >> >         Injector: starting at 2011-07-17 09:31:19
> >         > > > >> >
> >         > > > >> >         Injector:
> >         crawlDb: /home/llist/nutchData/crawl/crawldb
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Injector:
> >         urlDir: /home/llist/nutchData/seed/urls
> >         > > > >> >
> >         > > > >> >         Injector: Converting injected urls to crawl
> >         db entries.
> >         > > > >> >         Injector: Merging injected urls into crawl
> >         db.
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Injector: finished at 2011-07-17 09:31:22,
> >         elapsed: 00:00:02
> >         > > > >> >         Generator: starting at 2011-07-17 09:31:22
> >         > > > >> >
> >         > > > >> >         Generator: Selecting best-scoring urls due
> >         for fetch.
> >         > > > >> >         Generator: filtering: true
> >         > > > >> >         Generator: normalizing: true
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Generator: topN: 5
> >         > > > >> >
> >         > > > >> >         Generator: jobtracker is 'local',
> >         generating exactly one
> >         > > > >> >         partition.
> >         > > > >> >         Generator: Partitioning selected urls for
> >         politeness.
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Generator:
> >         > > > >> >
> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >         > > > >> >         Generator: finished at 2011-07-17 09:31:26,
> >         elapsed:
> >         > > 00:00:04
> >         > > > >> >
> >         > > > >> >         Fetcher: Your 'http.agent.name' value
> >         should be listed
> >         > > first
> >         > > > >> >         in
> >         > > > >> >         'http.robots.agents' property.
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Fetcher: starting at 2011-07-17 09:31:26
> >         > > > >> >         Fetcher:
> >         > > > >> >
> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >         > > > >> >
> >         > > > >> >         Fetcher: threads: 10
> >         > > > >> >         QueueFeeder finished: total 1 records + hit
> >         by time limit :0
> >         > > > >> >         fetching http://www.seek.com.au/
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -activeThreads=1, spinWaiting=0,
> >         fetchQueues.totalSize=0
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=0
> >         > > > >> >         -activeThreads=0, spinWaiting=0,
> >         fetchQueues.totalSize=0
> >         > > > >> >         -activeThreads=0
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Fetcher: finished at 2011-07-17 09:31:29,
> >         elapsed: 00:00:03
> >         > > > >> >         ParseSegment: starting at 2011-07-17
> >         09:31:29
> >         > > > >> >         ParseSegment:
> >         > > > >> >
> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >         > > > >> >         ParseSegment: finished at 2011-07-17
> >         09:31:32, elapsed:
> >         > > > >> >         00:00:02
> >         > > > >> >         CrawlDb update: starting at 2011-07-17
> >         09:31:32
> >         > > > >> >
> >         > > > >> >         CrawlDb update:
> >         db: /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         CrawlDb update: segments:
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         [/home/llist/nutchData/crawl/segments/20110717093124]
> >         > > > >> >
> >         > > > >> >         CrawlDb update: additions allowed: true
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         CrawlDb update: URL normalizing: true
> >         > > > >> >         CrawlDb update: URL filtering: true
> >         > > > >> >
> >         > > > >> >         CrawlDb update: Merging segment data into
> >         db.
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         CrawlDb update: finished at 2011-07-17
> >         09:31:34, elapsed:
> >         > > > >> >         00:00:02
> >         > > > >> >         :
> >         > > > >> >         :
> >         > > > >> >         :
> >         > > > >> >         :
> >         > > > >> >
> >         > > > >>
> >         > > >
> >         > >
> >         
> > -----------------------------------------------------------------------------------------------
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         On Sat, 2011-07-16 at 12:14 +1000, Leo
> >         Subscriptions wrote:
> >         > > > >> >
> >         > > > >> >         > Done, but now get additional errors:
> >         > > > >> >         >
> >         > > > >> >         > -------------------
> >         > > > >> >         > llist@LeosLinux:~/nutchData
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >         >
> >         updatedb /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         >
> >         -dir /home/llist/nutchData/crawl/segments/20110716105826
> >         > > > >> >         > CrawlDb update: starting at 2011-07-16
> >         11:03:56
> >         > > > >> >         > CrawlDb update:
> >         db: /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         > CrawlDb update: segments:
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         
> > [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> >         > > > >> >         >
> >         > > > >> >
> >         > > >
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         > >
> >         
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> >         > > > >> >         > CrawlDb update: additions allowed: true
> >         > > > >> >         > CrawlDb update: URL normalizing: false
> >         > > > >> >         > CrawlDb update: URL filtering: false
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > >
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/content
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         > >
> >         
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> >         > > > >> >         > CrawlDb update: Merging segment data into
> >         db.
> >         > > > >> >         > CrawlDb update: finished at 2011-07-16
> >         11:03:57, elapsed:
> >         > > > >> >         00:00:01
> >         > > > >> >         >
> >         -------------------------------------------
> >         > > > >> >         >
> >         > > > >> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus
> >         Jelsma wrote:
> >         > > > >> >         >
> >         > > > >> >         > > fetch, then parse.
> >         > > > >> >         > >
> >         > > > >> >         > > > I'm running nutch 1.3 on 64 bit
> >         Ubuntu, following are
> >         > > > >> >         the commands and
> >         > > > >> >         > > > relevant output.
> >         > > > >> >         > > >
> >         > > > >> >         > > > ----------------------------------
> >         > > > >> >         > > > llist@LeosLinux:~
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >         > > >
> >         > > > >> >         inject /home/llist/nutchData/crawl/crawldb
> >         > > > >> /home/llist/nutchData/seed
> >         > > > >> >         > > > Injector: starting at 2011-07-15
> >         18:32:10
> >         > > > >> >         > > > Injector:
> >         crawlDb: /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         > > > Injector:
> >         urlDir: /home/llist/nutchData/seed
> >         > > > >> >         > > > Injector: Converting injected urls to
> >         crawl db
> >         > > entries.
> >         > > > >> >         > > > Injector: Merging injected urls into
> >         crawl db.
> >         > > > >> >         > > > Injector: finished at 2011-07-15
> >         18:32:13, elapsed:
> >         > > > >> >         00:00:02
> >         > > > >> >         > > > =================
> >         > > > >> >         > > > llist@LeosLinux:~
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >         > > >
> >         generate /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         > > > /home/llist/nutchData/crawl/segments
> >         Generator:
> >         > > starting
> >         > > > >> >         at 2011-07-15
> >         > > > >> >         > > > 18:32:41
> >         > > > >> >         > > > Generator: Selecting best-scoring
> >         urls due for fetch.
> >         > > > >> >         > > > Generator: filtering: true
> >         > > > >> >         > > > Generator: normalizing: true
> >         > > > >> >         > > > Generator: jobtracker is 'local',
> >         generating exactly
> >         > > one
> >         > > > >> >         partition.
> >         > > > >> >         > > > Generator: Partitioning selected urls
> >         for politeness.
> >         > > > >> >         > > > Generator:
> >         > > > >> >
> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > >> >         > > > Generator: finished at 2011-07-15
> >         18:32:45, elapsed:
> >         > > > >> >         00:00:03
> >         > > > >> >         > > > ==================
> >         > > > >> >         > > > llist@LeosLinux:~
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >         > > >
> >         > > > >> >
> >         fetch /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > >> >         > > > Fetcher: Your 'http.agent.name' value
> >         should be
> >         > > listed
> >         > > > >> >         first in
> >         > > > >> >         > > > 'http.robots.agents' property.
> >         > > > >> >         > > > Fetcher: starting at 2011-07-15
> >         18:34:55
> >         > > > >> >         > > > Fetcher:
> >         > > > >> >
> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > >> >         > > > Fetcher: threads: 10
> >         > > > >> >         > > > QueueFeeder finished: total 1 records
> >         + hit by time
> >         > > > >> >         limit :0
> >         > > > >> >         > > > fetching http://www.seek.com.au/
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=2
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -activeThreads=1, spinWaiting=0,
> >         > > fetchQueues.totalSize=0
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=0
> >         > > > >> >         > > > -activeThreads=0, spinWaiting=0,
> >         > > fetchQueues.totalSize=0
> >         > > > >> >         > > > -activeThreads=0
> >         > > > >> >         > > > Fetcher: finished at 2011-07-15
> >         18:34:59, elapsed:
> >         > > > >> >         00:00:03
> >         > > > >> >         > > > =================
> >         > > > >> >         > > > llist@LeosLinux:~
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >         > > >
> >         updatedb /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         > > > -dir
> >         > > /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > >> >         > > > CrawlDb update: starting at
> >         2011-07-15 18:36:00
> >         > > > >> >         > > > CrawlDb update: db:
> >         > > /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         > > > CrawlDb update: segments:
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > > >>
> >         
> > [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > > >>
> >         > >
> >         
> > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > >
> >         file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> >         > > > >> >         > > > CrawlDb update: additions allowed:
> >         true
> >         > > > >> >         > > > CrawlDb update: URL normalizing:
> >         false
> >         > > > >> >         > > > CrawlDb update: URL filtering: false
> >         > > > >> >         > > > - skipping invalid segment
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> >         > > > >> >         > > > - skipping invalid segment
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > > >>
> >         > >
> >         
> > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> >         > > > >> >         > > > - skipping invalid segment
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > >
> >         file:/home/llist/nutchData/crawl/segments/20110715183244/content
> >         > > > >> >         > > > CrawlDb update: Merging segment data
> >         into db.
> >         > > > >> >         > > > CrawlDb update: finished at
> >         2011-07-15 18:36:01,
> >         > > > >> >         elapsed: 00:00:01
> >         > > > >> >         > > > -----------------------------------
> >         > > > >> >         > > >
> >         > > > >> >         > > > Appreciate any hints on what I'm
> >         missing.
> >         > > > >> >         >
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> > --
> >         > > > >> > Lewis
> >         > > > >> >
> >         > > > >>
> >         > > > >>
> >         > > > >>
> >         > > > >
> >         > > > >
> >         > > > > --
> >         > > > > *
> >         > > > > *Open Source Solutions for Text Engineering
> >         > > > >
> >         > > > > http://digitalpebble.blogspot.com/
> >         > > > > http://www.digitalpebble.com
> >         > > > >
> >         > > >
> >         > >
> >         > >
> >         > >
> >         > > --
> >         > > *
> >         > > *Open Source Solutions for Text Engineering
> >         > >
> >         > > http://digitalpebble.blogspot.com/
> >         > > http://www.digitalpebble.com
> >         > >
> >         >
> >         >
> >         >
> >         
> >         
> >         
> > 
> > 
> > 
> > 
> > -- 
> > Lewis 
> > 
> 
>

Re: skipping invalid segments nutch 1.3

Reply via email to