Re: skipping invalid segments nutch 1.3

Cam Bazz Wed, 20 Jul 2011 05:31:43 -0700

Hello,

I think there is a mislead in the documentation, it does not tell us
that we have to parse.


On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
<[email protected]> wrote:
> Haven't you forgotten to call parse?
>
> On 19 July 2011 23:40, Leo Subscriptions <[email protected]> wrote:
>
>> Hi Lewis,
>>
>> You are correct about the last post not showing any errors. I just
>> wanted to show that I don't get any errors if I use 'crawl' and to prove
>> that I do not have any faults in the conf files or the directories.
>>
>> I still get the errors if I use the individual commands inject,
>> generate, fetch....
>>
>> Cheers,
>>
>> Leo
>>
>>
>>
>>  On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:
>>
>> > Hi Leo
>> >
>> > Did you resolve?
>> >
>> > Your second log data doesn't appear to show any errors however the
>> > problem you specify if one I have witnessed myself while ago. Since
>> > you posted have you been able to replicate... or resolve?
>> >
>> >
>> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
>> > <[email protected]> wrote:
>> >
>> >         I've used crawl to ensure config is correct and I don't get
>> >         any errors,
>> >         so I must be doing something wrong with the individual steps,
>> >         but can;t
>> >         see what.
>> >
>> >
>> --------------------------------------------------------------------------------------------------------------------
>> >
>> >         llist@LeosLinux:~/nutchData
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >
>> >
>> >         crawl /home/llist/nutchData/seed/urls
>> >         -dir /home/llist/nutchData/crawl
>> >         -depth 3 -topN 5
>> >         solrUrl is not set, indexing will be skipped...
>> >         crawl started in: /home/llist/nutchData/crawl
>> >         rootUrlDir = /home/llist/nutchData/seed/urls
>> >         threads = 10
>> >         depth = 3
>> >         solrUrl=null
>> >         topN = 5
>> >         Injector: starting at 2011-07-17 09:31:19
>> >
>> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
>> >
>> >
>> >         Injector: urlDir: /home/llist/nutchData/seed/urls
>> >
>> >         Injector: Converting injected urls to crawl db entries.
>> >         Injector: Merging injected urls into crawl db.
>> >
>> >
>> >         Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
>> >         Generator: starting at 2011-07-17 09:31:22
>> >
>> >         Generator: Selecting best-scoring urls due for fetch.
>> >         Generator: filtering: true
>> >         Generator: normalizing: true
>> >
>> >
>> >         Generator: topN: 5
>> >
>> >         Generator: jobtracker is 'local', generating exactly one
>> >         partition.
>> >         Generator: Partitioning selected urls for politeness.
>> >
>> >
>> >         Generator:
>> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
>> >         Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04
>> >
>> >         Fetcher: Your 'http.agent.name' value should be listed first
>> >         in
>> >         'http.robots.agents' property.
>> >
>> >
>> >         Fetcher: starting at 2011-07-17 09:31:26
>> >         Fetcher:
>> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
>> >
>> >         Fetcher: threads: 10
>> >         QueueFeeder finished: total 1 records + hit by time limit :0
>> >         fetching http://www.seek.com.au/
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >         -finishing thread FetcherThread, activeThreads=0
>> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> >         -activeThreads=0
>> >
>> >
>> >         Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
>> >         ParseSegment: starting at 2011-07-17 09:31:29
>> >         ParseSegment:
>> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
>> >         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
>> >         00:00:02
>> >         CrawlDb update: starting at 2011-07-17 09:31:32
>> >
>> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>> >         CrawlDb update: segments:
>> >
>> >
>> >         [/home/llist/nutchData/crawl/segments/20110717093124]
>> >
>> >         CrawlDb update: additions allowed: true
>> >
>> >
>> >         CrawlDb update: URL normalizing: true
>> >         CrawlDb update: URL filtering: true
>> >
>> >         CrawlDb update: Merging segment data into db.
>> >
>> >
>> >         CrawlDb update: finished at 2011-07-17 09:31:34, elapsed:
>> >         00:00:02
>> >         :
>> >         :
>> >         :
>> >         :
>> >
>> -----------------------------------------------------------------------------------------------
>> >
>> >
>> >
>> >         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
>> >
>> >         > Done, but now get additional errors:
>> >         >
>> >         > -------------------
>> >         > llist@LeosLinux:~/nutchData
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >         > updatedb /home/llist/nutchData/crawl/crawldb
>> >         > -dir /home/llist/nutchData/crawl/segments/20110716105826
>> >         > CrawlDb update: starting at 2011-07-16 11:03:56
>> >         > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>> >         > CrawlDb update: segments:
>> >         >
>> >
>> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
>> >         >
>> >         file:/home/llist/nutchData/crawl/segments/20110716105826/content,
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
>> >         > CrawlDb update: additions allowed: true
>> >         > CrawlDb update: URL normalizing: false
>> >         > CrawlDb update: URL filtering: false
>> >         >  - skipping invalid segment
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
>> >         >  - skipping invalid segment
>> >         >
>> >         file:/home/llist/nutchData/crawl/segments/20110716105826/content
>> >         >  - skipping invalid segment
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
>> >         >  - skipping invalid segment
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
>> >         >  - skipping invalid segment
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
>> >         >  - skipping invalid segment
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
>> >         > CrawlDb update: Merging segment data into db.
>> >         > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed:
>> >         00:00:01
>> >         > -------------------------------------------
>> >         >
>> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
>> >         >
>> >         > > fetch, then parse.
>> >         > >
>> >         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are
>> >         the commands and
>> >         > > > relevant output.
>> >         > > >
>> >         > > > ----------------------------------
>> >         > > > llist@LeosLinux:~
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >         > > >
>> >         inject /home/llist/nutchData/crawl/crawldb
>> /home/llist/nutchData/seed
>> >         > > > Injector: starting at 2011-07-15 18:32:10
>> >         > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
>> >         > > > Injector: urlDir: /home/llist/nutchData/seed
>> >         > > > Injector: Converting injected urls to crawl db entries.
>> >         > > > Injector: Merging injected urls into crawl db.
>> >         > > > Injector: finished at 2011-07-15 18:32:13, elapsed:
>> >         00:00:02
>> >         > > > =================
>> >         > > > llist@LeosLinux:~
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >         > > > generate /home/llist/nutchData/crawl/crawldb
>> >         > > > /home/llist/nutchData/crawl/segments Generator: starting
>> >         at 2011-07-15
>> >         > > > 18:32:41
>> >         > > > Generator: Selecting best-scoring urls due for fetch.
>> >         > > > Generator: filtering: true
>> >         > > > Generator: normalizing: true
>> >         > > > Generator: jobtracker is 'local', generating exactly one
>> >         partition.
>> >         > > > Generator: Partitioning selected urls for politeness.
>> >         > > > Generator:
>> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
>> >         > > > Generator: finished at 2011-07-15 18:32:45, elapsed:
>> >         00:00:03
>> >         > > > ==================
>> >         > > > llist@LeosLinux:~
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >         > > >
>> >         fetch /home/llist/nutchData/crawl/segments/20110715183244
>> >         > > > Fetcher: Your 'http.agent.name' value should be listed
>> >         first in
>> >         > > > 'http.robots.agents' property.
>> >         > > > Fetcher: starting at 2011-07-15 18:34:55
>> >         > > > Fetcher:
>> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
>> >         > > > Fetcher: threads: 10
>> >         > > > QueueFeeder finished: total 1 records + hit by time
>> >         limit :0
>> >         > > > fetching http://www.seek.com.au/
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=2
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >         > > > -finishing thread FetcherThread, activeThreads=0
>> >         > > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> >         > > > -activeThreads=0
>> >         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
>> >         00:00:03
>> >         > > > =================
>> >         > > > llist@LeosLinux:~
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >         > > > updatedb /home/llist/nutchData/crawl/crawldb
>> >         > > > -dir /home/llist/nutchData/crawl/segments/20110715183244
>> >         > > > CrawlDb update: starting at 2011-07-15 18:36:00
>> >         > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>> >         > > > CrawlDb update: segments:
>> >         > > >
>> >
>> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
>> >         > > >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
>> >         > > >
>> >         file:/home/llist/nutchData/crawl/segments/20110715183244/content]
>> >         > > > CrawlDb update: additions allowed: true
>> >         > > > CrawlDb update: URL normalizing: false
>> >         > > > CrawlDb update: URL filtering: false
>> >         > > > - skipping invalid segment
>> >         > > >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
>> >         > > > - skipping invalid segment
>> >         > > >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
>> >         > > > - skipping invalid segment
>> >         > > >
>> >         file:/home/llist/nutchData/crawl/segments/20110715183244/content
>> >         > > > CrawlDb update: Merging segment data into db.
>> >         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
>> >         elapsed: 00:00:01
>> >         > > > -----------------------------------
>> >         > > >
>> >         > > > Appreciate any hints on what I'm missing.
>> >         >
>> >         >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Lewis
>> >
>>
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: skipping invalid segments nutch 1.3

Reply via email to