Re: skipping invalid segments nutch 1.3

Leo Subscriptions Tue, 19 Jul 2011 15:40:36 -0700

Hi Lewis,

You are correct about the last post not showing any errors. I just
wanted to show that I don't get any errors if I use 'crawl' and to prove
that I do not have any faults in the conf files or the directories.


I still get the errors if I use the individual commands inject,
generate, fetch....

Cheers,

Leo



 On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:

> Hi Leo
> 
> Did you resolve?
> 
> Your second log data doesn't appear to show any errors however the
> problem you specify if one I have witnessed myself while ago. Since
> you posted have you been able to replicate... or resolve?
> 
> 
> On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> <[email protected]> wrote:
> 
>         I've used crawl to ensure config is correct and I don't get
>         any errors,
>         so I must be doing something wrong with the individual steps,
>         but can;t
>         see what.
>         
>         
> --------------------------------------------------------------------------------------------------------------------
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         
>         
>         crawl /home/llist/nutchData/seed/urls
>         -dir /home/llist/nutchData/crawl
>         -depth 3 -topN 5
>         solrUrl is not set, indexing will be skipped...
>         crawl started in: /home/llist/nutchData/crawl
>         rootUrlDir = /home/llist/nutchData/seed/urls
>         threads = 10
>         depth = 3
>         solrUrl=null
>         topN = 5
>         Injector: starting at 2011-07-17 09:31:19
>         
>         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
>         
>         
>         Injector: urlDir: /home/llist/nutchData/seed/urls
>         
>         Injector: Converting injected urls to crawl db entries.
>         Injector: Merging injected urls into crawl db.
>         
>         
>         Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
>         Generator: starting at 2011-07-17 09:31:22
>         
>         Generator: Selecting best-scoring urls due for fetch.
>         Generator: filtering: true
>         Generator: normalizing: true
>         
>         
>         Generator: topN: 5
>         
>         Generator: jobtracker is 'local', generating exactly one
>         partition.
>         Generator: Partitioning selected urls for politeness.
>         
>         
>         Generator:
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04
>         
>         Fetcher: Your 'http.agent.name' value should be listed first
>         in
>         'http.robots.agents' property.
>         
>         
>         Fetcher: starting at 2011-07-17 09:31:26
>         Fetcher:
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         
>         Fetcher: threads: 10
>         QueueFeeder finished: total 1 records + hit by time limit :0
>         fetching http://www.seek.com.au/
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         
>         -finishing thread FetcherThread, activeThreads=1
>         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>         -finishing thread FetcherThread, activeThreads=0
>         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>         -activeThreads=0
>         
>         
>         Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
>         ParseSegment: starting at 2011-07-17 09:31:29
>         ParseSegment:
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
>         00:00:02
>         CrawlDb update: starting at 2011-07-17 09:31:32
>         
>         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>         CrawlDb update: segments:
>         
>         
>         [/home/llist/nutchData/crawl/segments/20110717093124]
>         
>         CrawlDb update: additions allowed: true
>         
>         
>         CrawlDb update: URL normalizing: true
>         CrawlDb update: URL filtering: true
>         
>         CrawlDb update: Merging segment data into db.
>         
>         
>         CrawlDb update: finished at 2011-07-17 09:31:34, elapsed:
>         00:00:02
>         :
>         :
>         :
>         :
>         
> -----------------------------------------------------------------------------------------------
>         
>         
>         
>         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
>         
>         > Done, but now get additional errors:
>         >
>         > -------------------
>         > llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         > updatedb /home/llist/nutchData/crawl/crawldb
>         > -dir /home/llist/nutchData/crawl/segments/20110716105826
>         > CrawlDb update: starting at 2011-07-16 11:03:56
>         > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>         > CrawlDb update: segments:
>         >
>         [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/content,
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
>         >
>         
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
>         > CrawlDb update: additions allowed: true
>         > CrawlDb update: URL normalizing: false
>         > CrawlDb update: URL filtering: false
>         >  - skipping invalid segment
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
>         >  - skipping invalid segment
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/content
>         >  - skipping invalid segment
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
>         >  - skipping invalid segment
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
>         >  - skipping invalid segment
>         >
>         
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
>         >  - skipping invalid segment
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
>         > CrawlDb update: Merging segment data into db.
>         > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed:
>         00:00:01
>         > -------------------------------------------
>         >
>         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
>         >
>         > > fetch, then parse.
>         > >
>         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are
>         the commands and
>         > > > relevant output.
>         > > >
>         > > > ----------------------------------
>         > > > llist@LeosLinux:~
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > >
>         inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed
>         > > > Injector: starting at 2011-07-15 18:32:10
>         > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
>         > > > Injector: urlDir: /home/llist/nutchData/seed
>         > > > Injector: Converting injected urls to crawl db entries.
>         > > > Injector: Merging injected urls into crawl db.
>         > > > Injector: finished at 2011-07-15 18:32:13, elapsed:
>         00:00:02
>         > > > =================
>         > > > llist@LeosLinux:~
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > generate /home/llist/nutchData/crawl/crawldb
>         > > > /home/llist/nutchData/crawl/segments Generator: starting
>         at 2011-07-15
>         > > > 18:32:41
>         > > > Generator: Selecting best-scoring urls due for fetch.
>         > > > Generator: filtering: true
>         > > > Generator: normalizing: true
>         > > > Generator: jobtracker is 'local', generating exactly one
>         partition.
>         > > > Generator: Partitioning selected urls for politeness.
>         > > > Generator:
>         segment: /home/llist/nutchData/crawl/segments/20110715183244
>         > > > Generator: finished at 2011-07-15 18:32:45, elapsed:
>         00:00:03
>         > > > ==================
>         > > > llist@LeosLinux:~
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > >
>         fetch /home/llist/nutchData/crawl/segments/20110715183244
>         > > > Fetcher: Your 'http.agent.name' value should be listed
>         first in
>         > > > 'http.robots.agents' property.
>         > > > Fetcher: starting at 2011-07-15 18:34:55
>         > > > Fetcher:
>         segment: /home/llist/nutchData/crawl/segments/20110715183244
>         > > > Fetcher: threads: 10
>         > > > QueueFeeder finished: total 1 records + hit by time
>         limit :0
>         > > > fetching http://www.seek.com.au/
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=2
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>         > > > -finishing thread FetcherThread, activeThreads=0
>         > > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>         > > > -activeThreads=0
>         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
>         00:00:03
>         > > > =================
>         > > > llist@LeosLinux:~
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > updatedb /home/llist/nutchData/crawl/crawldb
>         > > > -dir /home/llist/nutchData/crawl/segments/20110715183244
>         > > > CrawlDb update: starting at 2011-07-15 18:36:00
>         > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>         > > > CrawlDb update: segments:
>         > > >
>         [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
>         > > >
>         
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/content]
>         > > > CrawlDb update: additions allowed: true
>         > > > CrawlDb update: URL normalizing: false
>         > > > CrawlDb update: URL filtering: false
>         > > > - skipping invalid segment
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
>         > > > - skipping invalid segment
>         > > >
>         
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
>         > > > - skipping invalid segment
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/content
>         > > > CrawlDb update: Merging segment data into db.
>         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
>         elapsed: 00:00:01
>         > > > -----------------------------------
>         > > >
>         > > > Appreciate any hints on what I'm missing.
>         >
>         >
>         
>         
>         
> 
> 
> 
> 
> -- 
> Lewis 
>

Re: skipping invalid segments nutch 1.3

Reply via email to