Re: skipping invalid segments nutch 1.3

Leo Subscriptions Thu, 21 Jul 2011 16:51:20 -0700

Hi Lewis,

Will try your suggestion shortly, but am still puzzled why the crawl
command works. Isn't it using the same filter, etc?


Cheers,

Leo

On Thu, 2011-07-21 at 20:55 +0100, lewis john mcgibbney wrote:

> Hi Leo,
> 
> From the times both the fetching and parsing took, I suspecting that
> maybe Nutch didn't actually fetch the URL, however this may not be the
> case as I have nothing to benchmark it on. Unfortuantely on the
> occasion the URL http://wiki.apache.org actually redirects to
> http://wiki.apache.org/general/ so I'm going to post my log output
> from last URL you specified in an attempt to clear this one up. The
> following confirms that you are accurate with your observations that
> not only does this produce invalid segments but also nothing is
> fetched in the process.
> 
> Therefore the reason that we are getting the  - skipping invalid
> segment message is that we are not actually fetching any content. My
> initial thoughts were that your urlfilters were not set properly and I
> think that this is part of the case.
> 
> Please follow the syntax very carefully and it will work perfectly for
> you as follows
> 
> regex-urlfilter.txt
> --------------------------
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to
> break loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> # crawl URLs in the following domains.
> +^http://([a-z0-9]*\.)*seek.com.au/
> 
> # accept anything else
> #+.
> 
> seed file
> ----------------------
> http://www.seek.com.au
> 
> It sounds really trivial but I think that the trailing '/' in in your
> seed file may have been making all of the difference.
> 
> Please try, test with readdb and readseg and comment back.
> 
> Sorry for the delayed posts on this one I have not had much time to
> get to it. Hope all goes to plan. Evidence can be seen below
> 
> lewis@lewis-01:~/ASF/branch-1.4/runtime/local$ bin/nutch readdb
> crawldb -stats
> CrawlDb statistics start: crawldb
> Statistics for CrawlDb: crawldb
> TOTAL urls:    48
> retry 0:    48
> min score:    0.017
> avg score:    0.041125
> max score:    1.175
> status 1 (db_unfetched):    47
> status 2 (db_fetched):    1
> CrawlDb statistics: done
> 
> 
> 
> 
> 
> 
> On Thu, Jul 21, 2011 at 3:30 AM, Leo Subscriptions
> <[email protected]> wrote:
> 
>         Following are the suggested commands and the result as
>         suggested
>          I left the redirect as 0 as 'crawl' works without any issues.
>         The
>         problem only occurs when running the individual commands.
>         
>         ------- nutch-site.xml -------------------------------
>         <?xml version="1.0"?>
>         <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>         
>         <!-- Put site-specific property overrides in this file. -->
>         
>         <configuration>
>         
>         <property>
>          <name>http.agent.name</name>
>          <value>listers spider</value>
>         </property>
>         
>         <property>
>          <name>fetcher.verbose</name>
>          <value>true</value>
>          <description>If true, fetcher will log more
>         verbosely.</description>
>         </property>
>         
>         <property>
>          <name>http.verbose</name>
>          <value>true</value>
>          <description>If true, HTTP will log more
>         verbosely.</description>
>         </property>
>         
>         </configuration>
>         ---------------------------------------------------------------
>         
>         ------ Individual commands and
>         results-------------------------
>         
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         
>         
>         inject /home/llist/nutchData/crawl/crawldb 
> /home/llist/nutchData/seed/urls
>         Injector: starting at 2011-07-21 12:24:52
>         
>         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
>         Injector: urlDir: /home/llist/nutchData/seed/urls
>         Injector: Converting injected urls to crawl db entries.
>         Injector: Merging injected urls into crawl db.
>         
>         
>         Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02
>         
>         
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         
>         
>         generate /home/llist/nutchData/crawl/crawldb 
> /home/llist/nutchData/crawl/segments -topN 100
>         Generator: starting at 2011-07-21 12:25:16
>         
>         Generator: Selecting best-scoring urls due for fetch.
>         Generator: filtering: true
>         Generator: normalizing: true
>         
>         
>         Generator: topN: 100
>         
>         Generator: jobtracker is 'local', generating exactly one
>         partition.
>         Generator: Partitioning selected urls for politeness.
>         
>         
>         Generator:
>         segment: /home/llist/nutchData/crawl/segments/20110721122519
>         Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03
>         
>         
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         
>         
>         fetch /home/llist/nutchData/crawl/segments/20110721122519
>         
>         Fetcher: Your 'http.agent.name' value should be listed first
>         in
>         'http.robots.agents' property.
>         
>         
>         Fetcher: starting at 2011-07-21 12:26:36
>         Fetcher:
>         segment: /home/llist/nutchData/crawl/segments/20110721122519
>         
>         Fetcher: threads: 10
>         QueueFeeder finished: total 1 records + hit by time limit :0
>         
>         -finishing thread FetcherThread, activeThreads=1
>         
>         
>         fetching http://wiki.apache.org/
>         
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>         -finishing thread FetcherThread, activeThreads=0
>         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>         -activeThreads=0
>         
>         
>         Fetcher: finished at 2011-07-21 12:26:40, elapsed: 00:00:04
>         
>         
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         
>         
>         parse /home/llist/nutchData/crawl/segments/20110721122519
>         ParseSegment: starting at 2011-07-21 12:27:22
>         ParseSegment:
>         segment: /home/llist/nutchData/crawl/segments/20110721122519
>         ParseSegment: finished at 2011-07-21 12:27:24, elapsed:
>         00:00:01
>         
>         
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         updatedb /home/llist/nutchData/crawl/crawldb
>         
>         
>         -dir /home/llist/nutchData/crawl/segments/20110721122519
>         CrawlDb update: starting at 2011-07-21 12:28:03
>         
>         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>         CrawlDb update: segments:
>         
>         
>         [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
>         file:/home/llist/nutchData/crawl/segments/20110721122519/content,
>         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
>         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
>         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
>         
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
>         
>         CrawlDb update: additions allowed: true
>         CrawlDb update: URL normalizing: false
>         CrawlDb update: URL filtering: false
>          - skipping invalid segment
>         
>         
>         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text
>          - skipping invalid segment
>         file:/home/llist/nutchData/crawl/segments/20110721122519/content
>          - skipping invalid segment
>         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse
>          - skipping invalid segment
>         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data
>          - skipping invalid segment
>         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch
>          - skipping invalid segment
>         
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate
>         
>         CrawlDb update: Merging segment data into db.
>         
>         
>         CrawlDb update: finished at 2011-07-21 12:28:04, elapsed:
>         00:00:01
>         
>         
> ------------------------------------------------------------------------------------
>         
>         
>         
>         
>         
>         On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote:
>         
>         > There is no documentation for individual commands used to
>         run a Nutch 1.3
>         > crawl so I'm not sure where there has been a mislead. In the
>         instance that
>         > this was required I would direct newer users to the legacy
>         documentation for
>         > the time being.
>         >
>         > My comment to Leo was to understand whether he managed to
>         correct the
>         > invalid segments problem.
>         >
>         > Leo, if this still persists may I ask you to try again, I
>         will do the same
>         > and will be happy to provide feedback
>         >
>         > May I suggest the following
>         >
>         >
>         > use the following commands
>         >
>         > inject
>         > generate
>         > fetch
>         > parse
>         > updatedb
>         >
>         > At this stage we should be able to ascertain if something is
>         correct and
>         > hopefully debug. May I add the following... please make the
>         following
>         > additions to nutch-site.
>         >
>         > fetcher verbose - true
>         > http verbose - true
>         > check for redirects and set accordingly
>         >
>         >
>         > On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche <
>         > [email protected]> wrote:
>         >
>         > > The wiki can be edited and you are welcome to suggest
>         improvements if there
>         > > is something missing
>         > >
>         > > On 20 July 2011 13:31, Cam Bazz <[email protected]> wrote:
>         > >
>         > > > Hello,
>         > > >
>         > > > I think there is a mislead in the documentation, it does
>         not tell us
>         > > > that we have to parse.
>         > > >
>         > > > On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
>         > > > <[email protected]> wrote:
>         > > > > Haven't you forgotten to call parse?
>         > > > >
>         > > > > On 19 July 2011 23:40, Leo Subscriptions
>         <[email protected]>
>         > > > wrote:
>         > > > >
>         > > > >> Hi Lewis,
>         > > > >>
>         > > > >> You are correct about the last post not showing any
>         errors. I just
>         > > > >> wanted to show that I don't get any errors if I use
>         'crawl' and to
>         > > prove
>         > > > >> that I do not have any faults in the conf files or
>         the directories.
>         > > > >>
>         > > > >> I still get the errors if I use the individual
>         commands inject,
>         > > > >> generate, fetch....
>         > > > >>
>         > > > >> Cheers,
>         > > > >>
>         > > > >> Leo
>         > > > >>
>         > > > >>
>         > > > >>
>         > > > >>  On Tue, 2011-07-19 at 22:09 +0100, lewis john
>         mcgibbney wrote:
>         > > > >>
>         > > > >> > Hi Leo
>         > > > >> >
>         > > > >> > Did you resolve?
>         > > > >> >
>         > > > >> > Your second log data doesn't appear to show any
>         errors however the
>         > > > >> > problem you specify if one I have witnessed myself
>         while ago. Since
>         > > > >> > you posted have you been able to replicate... or
>         resolve?
>         > > > >> >
>         > > > >> >
>         > > > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
>         > > > >> > <[email protected]> wrote:
>         > > > >> >
>         > > > >> >         I've used crawl to ensure config is correct
>         and I don't get
>         > > > >> >         any errors,
>         > > > >> >         so I must be doing something wrong with the
>         individual
>         > > steps,
>         > > > >> >         but can;t
>         > > > >> >         see what.
>         > > > >> >
>         > > > >> >
>         > > > >>
>         > > >
>         > >
>         
> --------------------------------------------------------------------------------------------------------------------
>         > > > >> >
>         > > > >> >         llist@LeosLinux:~/nutchData
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >
>         > > > >> >
>         > > > >> >         crawl /home/llist/nutchData/seed/urls
>         > > > >> >         -dir /home/llist/nutchData/crawl
>         > > > >> >         -depth 3 -topN 5
>         > > > >> >         solrUrl is not set, indexing will be
>         skipped...
>         > > > >> >         crawl started
>         in: /home/llist/nutchData/crawl
>         > > > >> >         rootUrlDir
>         = /home/llist/nutchData/seed/urls
>         > > > >> >         threads = 10
>         > > > >> >         depth = 3
>         > > > >> >         solrUrl=null
>         > > > >> >         topN = 5
>         > > > >> >         Injector: starting at 2011-07-17 09:31:19
>         > > > >> >
>         > > > >> >         Injector:
>         crawlDb: /home/llist/nutchData/crawl/crawldb
>         > > > >> >
>         > > > >> >
>         > > > >> >         Injector:
>         urlDir: /home/llist/nutchData/seed/urls
>         > > > >> >
>         > > > >> >         Injector: Converting injected urls to crawl
>         db entries.
>         > > > >> >         Injector: Merging injected urls into crawl
>         db.
>         > > > >> >
>         > > > >> >
>         > > > >> >         Injector: finished at 2011-07-17 09:31:22,
>         elapsed: 00:00:02
>         > > > >> >         Generator: starting at 2011-07-17 09:31:22
>         > > > >> >
>         > > > >> >         Generator: Selecting best-scoring urls due
>         for fetch.
>         > > > >> >         Generator: filtering: true
>         > > > >> >         Generator: normalizing: true
>         > > > >> >
>         > > > >> >
>         > > > >> >         Generator: topN: 5
>         > > > >> >
>         > > > >> >         Generator: jobtracker is 'local',
>         generating exactly one
>         > > > >> >         partition.
>         > > > >> >         Generator: Partitioning selected urls for
>         politeness.
>         > > > >> >
>         > > > >> >
>         > > > >> >         Generator:
>         > > > >> >
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         > > > >> >         Generator: finished at 2011-07-17 09:31:26,
>         elapsed:
>         > > 00:00:04
>         > > > >> >
>         > > > >> >         Fetcher: Your 'http.agent.name' value
>         should be listed
>         > > first
>         > > > >> >         in
>         > > > >> >         'http.robots.agents' property.
>         > > > >> >
>         > > > >> >
>         > > > >> >         Fetcher: starting at 2011-07-17 09:31:26
>         > > > >> >         Fetcher:
>         > > > >> >
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         > > > >> >
>         > > > >> >         Fetcher: threads: 10
>         > > > >> >         QueueFeeder finished: total 1 records + hit
>         by time limit :0
>         > > > >> >         fetching http://www.seek.com.au/
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -activeThreads=1, spinWaiting=0,
>         fetchQueues.totalSize=0
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=0
>         > > > >> >         -activeThreads=0, spinWaiting=0,
>         fetchQueues.totalSize=0
>         > > > >> >         -activeThreads=0
>         > > > >> >
>         > > > >> >
>         > > > >> >         Fetcher: finished at 2011-07-17 09:31:29,
>         elapsed: 00:00:03
>         > > > >> >         ParseSegment: starting at 2011-07-17
>         09:31:29
>         > > > >> >         ParseSegment:
>         > > > >> >
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         > > > >> >         ParseSegment: finished at 2011-07-17
>         09:31:32, elapsed:
>         > > > >> >         00:00:02
>         > > > >> >         CrawlDb update: starting at 2011-07-17
>         09:31:32
>         > > > >> >
>         > > > >> >         CrawlDb update:
>         db: /home/llist/nutchData/crawl/crawldb
>         > > > >> >         CrawlDb update: segments:
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         [/home/llist/nutchData/crawl/segments/20110717093124]
>         > > > >> >
>         > > > >> >         CrawlDb update: additions allowed: true
>         > > > >> >
>         > > > >> >
>         > > > >> >         CrawlDb update: URL normalizing: true
>         > > > >> >         CrawlDb update: URL filtering: true
>         > > > >> >
>         > > > >> >         CrawlDb update: Merging segment data into
>         db.
>         > > > >> >
>         > > > >> >
>         > > > >> >         CrawlDb update: finished at 2011-07-17
>         09:31:34, elapsed:
>         > > > >> >         00:00:02
>         > > > >> >         :
>         > > > >> >         :
>         > > > >> >         :
>         > > > >> >         :
>         > > > >> >
>         > > > >>
>         > > >
>         > >
>         
> -----------------------------------------------------------------------------------------------
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> >         On Sat, 2011-07-16 at 12:14 +1000, Leo
>         Subscriptions wrote:
>         > > > >> >
>         > > > >> >         > Done, but now get additional errors:
>         > > > >> >         >
>         > > > >> >         > -------------------
>         > > > >> >         > llist@LeosLinux:~/nutchData
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >         >
>         updatedb /home/llist/nutchData/crawl/crawldb
>         > > > >> >         >
>         -dir /home/llist/nutchData/crawl/segments/20110716105826
>         > > > >> >         > CrawlDb update: starting at 2011-07-16
>         11:03:56
>         > > > >> >         > CrawlDb update:
>         db: /home/llist/nutchData/crawl/crawldb
>         > > > >> >         > CrawlDb update: segments:
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
>         > > > >> >         >
>         > > > >> >
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/content,
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         > >
>         
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
>         > > > >> >         > CrawlDb update: additions allowed: true
>         > > > >> >         > CrawlDb update: URL normalizing: false
>         > > > >> >         > CrawlDb update: URL filtering: false
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/content
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         > >
>         
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
>         > > > >> >         > CrawlDb update: Merging segment data into
>         db.
>         > > > >> >         > CrawlDb update: finished at 2011-07-16
>         11:03:57, elapsed:
>         > > > >> >         00:00:01
>         > > > >> >         >
>         -------------------------------------------
>         > > > >> >         >
>         > > > >> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus
>         Jelsma wrote:
>         > > > >> >         >
>         > > > >> >         > > fetch, then parse.
>         > > > >> >         > >
>         > > > >> >         > > > I'm running nutch 1.3 on 64 bit
>         Ubuntu, following are
>         > > > >> >         the commands and
>         > > > >> >         > > > relevant output.
>         > > > >> >         > > >
>         > > > >> >         > > > ----------------------------------
>         > > > >> >         > > > llist@LeosLinux:~
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >         > > >
>         > > > >> >         inject /home/llist/nutchData/crawl/crawldb
>         > > > >> /home/llist/nutchData/seed
>         > > > >> >         > > > Injector: starting at 2011-07-15
>         18:32:10
>         > > > >> >         > > > Injector:
>         crawlDb: /home/llist/nutchData/crawl/crawldb
>         > > > >> >         > > > Injector:
>         urlDir: /home/llist/nutchData/seed
>         > > > >> >         > > > Injector: Converting injected urls to
>         crawl db
>         > > entries.
>         > > > >> >         > > > Injector: Merging injected urls into
>         crawl db.
>         > > > >> >         > > > Injector: finished at 2011-07-15
>         18:32:13, elapsed:
>         > > > >> >         00:00:02
>         > > > >> >         > > > =================
>         > > > >> >         > > > llist@LeosLinux:~
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >         > > >
>         generate /home/llist/nutchData/crawl/crawldb
>         > > > >> >         > > > /home/llist/nutchData/crawl/segments
>         Generator:
>         > > starting
>         > > > >> >         at 2011-07-15
>         > > > >> >         > > > 18:32:41
>         > > > >> >         > > > Generator: Selecting best-scoring
>         urls due for fetch.
>         > > > >> >         > > > Generator: filtering: true
>         > > > >> >         > > > Generator: normalizing: true
>         > > > >> >         > > > Generator: jobtracker is 'local',
>         generating exactly
>         > > one
>         > > > >> >         partition.
>         > > > >> >         > > > Generator: Partitioning selected urls
>         for politeness.
>         > > > >> >         > > > Generator:
>         > > > >> >
>         segment: /home/llist/nutchData/crawl/segments/20110715183244
>         > > > >> >         > > > Generator: finished at 2011-07-15
>         18:32:45, elapsed:
>         > > > >> >         00:00:03
>         > > > >> >         > > > ==================
>         > > > >> >         > > > llist@LeosLinux:~
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >         > > >
>         > > > >> >
>         fetch /home/llist/nutchData/crawl/segments/20110715183244
>         > > > >> >         > > > Fetcher: Your 'http.agent.name' value
>         should be
>         > > listed
>         > > > >> >         first in
>         > > > >> >         > > > 'http.robots.agents' property.
>         > > > >> >         > > > Fetcher: starting at 2011-07-15
>         18:34:55
>         > > > >> >         > > > Fetcher:
>         > > > >> >
>         segment: /home/llist/nutchData/crawl/segments/20110715183244
>         > > > >> >         > > > Fetcher: threads: 10
>         > > > >> >         > > > QueueFeeder finished: total 1 records
>         + hit by time
>         > > > >> >         limit :0
>         > > > >> >         > > > fetching http://www.seek.com.au/
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=2
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -activeThreads=1, spinWaiting=0,
>         > > fetchQueues.totalSize=0
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=0
>         > > > >> >         > > > -activeThreads=0, spinWaiting=0,
>         > > fetchQueues.totalSize=0
>         > > > >> >         > > > -activeThreads=0
>         > > > >> >         > > > Fetcher: finished at 2011-07-15
>         18:34:59, elapsed:
>         > > > >> >         00:00:03
>         > > > >> >         > > > =================
>         > > > >> >         > > > llist@LeosLinux:~
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >         > > >
>         updatedb /home/llist/nutchData/crawl/crawldb
>         > > > >> >         > > > -dir
>         > > /home/llist/nutchData/crawl/segments/20110715183244
>         > > > >> >         > > > CrawlDb update: starting at
>         2011-07-15 18:36:00
>         > > > >> >         > > > CrawlDb update: db:
>         > > /home/llist/nutchData/crawl/crawldb
>         > > > >> >         > > > CrawlDb update: segments:
>         > > > >> >         > > >
>         > > > >> >
>         > > > >>
>         [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
>         > > > >> >         > > >
>         > > > >> >
>         > > > >>
>         > >
>         
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
>         > > > >> >         > > >
>         > > > >> >
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/content]
>         > > > >> >         > > > CrawlDb update: additions allowed:
>         true
>         > > > >> >         > > > CrawlDb update: URL normalizing:
>         false
>         > > > >> >         > > > CrawlDb update: URL filtering: false
>         > > > >> >         > > > - skipping invalid segment
>         > > > >> >         > > >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
>         > > > >> >         > > > - skipping invalid segment
>         > > > >> >         > > >
>         > > > >> >
>         > > > >>
>         > >
>         
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
>         > > > >> >         > > > - skipping invalid segment
>         > > > >> >         > > >
>         > > > >> >
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/content
>         > > > >> >         > > > CrawlDb update: Merging segment data
>         into db.
>         > > > >> >         > > > CrawlDb update: finished at
>         2011-07-15 18:36:01,
>         > > > >> >         elapsed: 00:00:01
>         > > > >> >         > > > -----------------------------------
>         > > > >> >         > > >
>         > > > >> >         > > > Appreciate any hints on what I'm
>         missing.
>         > > > >> >         >
>         > > > >> >         >
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> > --
>         > > > >> > Lewis
>         > > > >> >
>         > > > >>
>         > > > >>
>         > > > >>
>         > > > >
>         > > > >
>         > > > > --
>         > > > > *
>         > > > > *Open Source Solutions for Text Engineering
>         > > > >
>         > > > > http://digitalpebble.blogspot.com/
>         > > > > http://www.digitalpebble.com
>         > > > >
>         > > >
>         > >
>         > >
>         > >
>         > > --
>         > > *
>         > > *Open Source Solutions for Text Engineering
>         > >
>         > > http://digitalpebble.blogspot.com/
>         > > http://www.digitalpebble.com
>         > >
>         >
>         >
>         >
>         
>         
>         
> 
> 
> 
> 
> -- 
> Lewis 
>

Re: skipping invalid segments nutch 1.3

Reply via email to