Re: skipping invalid segments nutch 1.3

lewis john mcgibbney Tue, 19 Jul 2011 14:09:47 -0700

Hi Leo

Did you resolve?


Your second log data doesn't appear to show any errors however the problem
you specify if one I have witnessed myself while ago. Since you posted have
you been able to replicate... or resolve?

On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions <[email protected]
> wrote:

> I've used crawl to ensure config is correct and I don't get any errors,
> so I must be doing something wrong with the individual steps, but can;t
> see what.
>
>
> --------------------------------------------------------------------------------------------------------------------
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> crawl /home/llist/nutchData/seed/urls -dir /home/llist/nutchData/crawl
> -depth 3 -topN 5
> solrUrl is not set, indexing will be skipped...
> crawl started in: /home/llist/nutchData/crawl
> rootUrlDir = /home/llist/nutchData/seed/urls
> threads = 10
> depth = 3
> solrUrl=null
> topN = 5
> Injector: starting at 2011-07-17 09:31:19
> Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> Injector: urlDir: /home/llist/nutchData/seed/urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
> Generator: starting at 2011-07-17 09:31:22
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: /home/llist/nutchData/crawl/segments/20110717093124
> Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-07-17 09:31:26
> Fetcher: segment: /home/llist/nutchData/crawl/segments/20110717093124
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching http://www.seek.com.au/
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
> ParseSegment: starting at 2011-07-17 09:31:29
> ParseSegment:
> segment: /home/llist/nutchData/crawl/segments/20110717093124
> ParseSegment: finished at 2011-07-17 09:31:32, elapsed: 00:00:02
> CrawlDb update: starting at 2011-07-17 09:31:32
> CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> CrawlDb update: segments:
> [/home/llist/nutchData/crawl/segments/20110717093124]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-07-17 09:31:34, elapsed: 00:00:02
> :
> :
> :
> :
>
> -----------------------------------------------------------------------------------------------
>
> On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
>
> > Done, but now get additional errors:
> >
> > -------------------
> > llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> > updatedb /home/llist/nutchData/crawl/crawldb
> > -dir /home/llist/nutchData/crawl/segments/20110716105826
> > CrawlDb update: starting at 2011-07-16 11:03:56
> > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > CrawlDb update: segments:
> > [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> > file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/content
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed: 00:00:01
> > -------------------------------------------
> >
> > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
> >
> > > fetch, then parse.
> > >
> > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are the commands
> and
> > > > relevant output.
> > > >
> > > > ----------------------------------
> > > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > > inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed
> > > > Injector: starting at 2011-07-15 18:32:10
> > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > > > Injector: urlDir: /home/llist/nutchData/seed
> > > > Injector: Converting injected urls to crawl db entries.
> > > > Injector: Merging injected urls into crawl db.
> > > > Injector: finished at 2011-07-15 18:32:13, elapsed: 00:00:02
> > > > =================
> > > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > > generate /home/llist/nutchData/crawl/crawldb
> > > > /home/llist/nutchData/crawl/segments Generator: starting at
> 2011-07-15
> > > > 18:32:41
> > > > Generator: Selecting best-scoring urls due for fetch.
> > > > Generator: filtering: true
> > > > Generator: normalizing: true
> > > > Generator: jobtracker is 'local', generating exactly one partition.
> > > > Generator: Partitioning selected urls for politeness.
> > > > Generator: segment:
> /home/llist/nutchData/crawl/segments/20110715183244
> > > > Generator: finished at 2011-07-15 18:32:45, elapsed: 00:00:03
> > > > ==================
> > > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > > fetch /home/llist/nutchData/crawl/segments/20110715183244
> > > > Fetcher: Your 'http.agent.name' value should be listed first in
> > > > 'http.robots.agents' property.
> > > > Fetcher: starting at 2011-07-15 18:34:55
> > > > Fetcher: segment: /home/llist/nutchData/crawl/segments/20110715183244
> > > > Fetcher: threads: 10
> > > > QueueFeeder finished: total 1 records + hit by time limit :0
> > > > fetching http://www.seek.com.au/
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=2
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > > -finishing thread FetcherThread, activeThreads=0
> > > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > > -activeThreads=0
> > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed: 00:00:03
> > > > =================
> > > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > > updatedb /home/llist/nutchData/crawl/crawldb
> > > > -dir /home/llist/nutchData/crawl/segments/20110715183244
> > > > CrawlDb update: starting at 2011-07-15 18:36:00
> > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > > > CrawlDb update: segments:
> > > >
> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> > > >
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> > > > CrawlDb update: additions allowed: true
> > > > CrawlDb update: URL normalizing: false
> > > > CrawlDb update: URL filtering: false
> > > > - skipping invalid segment
> > > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> > > > - skipping invalid segment
> > > >
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> > > > - skipping invalid segment
> > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content
> > > > CrawlDb update: Merging segment data into db.
> > > > CrawlDb update: finished at 2011-07-15 18:36:01, elapsed: 00:00:01
> > > > -----------------------------------
> > > >
> > > > Appreciate any hints on what I'm missing.
> >
> >
>
>
>


-- 
*Lewis*

Re: skipping invalid segments nutch 1.3

Reply via email to