Re: skipping invalid segments nutch 1.3

Leo Subscriptions Wed, 20 Jul 2011 19:30:12 -0700

Following are the suggested commands and the result as suggested
 I left the redirect as 0 as 'crawl' works without any issues. The
problem only occurs when running the individual commands.


------- nutch-site.xml -------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>http.agent.name</name>
  <value>listers spider</value>
</property>

<property>
  <name>fetcher.verbose</name>
  <value>true</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>

<property>
  <name>http.verbose</name>
  <value>true</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>

</configuration>
---------------------------------------------------------------

------ Individual commands and results-------------------------

llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed/urls
Injector: starting at 2011-07-21 12:24:52
Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
Injector: urlDir: /home/llist/nutchData/seed/urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02


llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
generate /home/llist/nutchData/crawl/crawldb 
/home/llist/nutchData/crawl/segments -topN 100
Generator: starting at 2011-07-21 12:25:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: /home/llist/nutchData/crawl/segments/20110721122519
Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03


llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
fetch /home/llist/nutchData/crawl/segments/20110721122519
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2011-07-21 12:26:36
Fetcher: segment: /home/llist/nutchData/crawl/segments/20110721122519
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
-finishing thread FetcherThread, activeThreads=1
fetching http://wiki.apache.org/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-07-21 12:26:40, elapsed: 00:00:04


llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
parse /home/llist/nutchData/crawl/segments/20110721122519
ParseSegment: starting at 2011-07-21 12:27:22
ParseSegment:
segment: /home/llist/nutchData/crawl/segments/20110721122519
ParseSegment: finished at 2011-07-21 12:27:24, elapsed: 00:00:01


llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
updatedb /home/llist/nutchData/crawl/crawldb
-dir /home/llist/nutchData/crawl/segments/20110721122519
CrawlDb update: starting at 2011-07-21 12:28:03
CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
CrawlDb update: segments:
[file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
file:/home/llist/nutchData/crawl/segments/20110721122519/content,
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/content
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-07-21 12:28:04, elapsed: 00:00:01

------------------------------------------------------------------------------------



On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote:

> There is no documentation for individual commands used to run a Nutch 1.3
> crawl so I'm not sure where there has been a mislead. In the instance that
> this was required I would direct newer users to the legacy documentation for
> the time being.
> 
> My comment to Leo was to understand whether he managed to correct the
> invalid segments problem.
> 
> Leo, if this still persists may I ask you to try again, I will do the same
> and will be happy to provide feedback
> 
> May I suggest the following
> 
> 
> use the following commands
> 
> inject
> generate
> fetch
> parse
> updatedb
> 
> At this stage we should be able to ascertain if something is correct and
> hopefully debug. May I add the following... please make the following
> additions to nutch-site.
> 
> fetcher verbose - true
> http verbose - true
> check for redirects and set accordingly
> 
> 
> On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche <
> [email protected]> wrote:
> 
> > The wiki can be edited and you are welcome to suggest improvements if there
> > is something missing
> >
> > On 20 July 2011 13:31, Cam Bazz <[email protected]> wrote:
> >
> > > Hello,
> > >
> > > I think there is a mislead in the documentation, it does not tell us
> > > that we have to parse.
> > >
> > > On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
> > > <[email protected]> wrote:
> > > > Haven't you forgotten to call parse?
> > > >
> > > > On 19 July 2011 23:40, Leo Subscriptions <[email protected]>
> > > wrote:
> > > >
> > > >> Hi Lewis,
> > > >>
> > > >> You are correct about the last post not showing any errors. I just
> > > >> wanted to show that I don't get any errors if I use 'crawl' and to
> > prove
> > > >> that I do not have any faults in the conf files or the directories.
> > > >>
> > > >> I still get the errors if I use the individual commands inject,
> > > >> generate, fetch....
> > > >>
> > > >> Cheers,
> > > >>
> > > >> Leo
> > > >>
> > > >>
> > > >>
> > > >>  On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:
> > > >>
> > > >> > Hi Leo
> > > >> >
> > > >> > Did you resolve?
> > > >> >
> > > >> > Your second log data doesn't appear to show any errors however the
> > > >> > problem you specify if one I have witnessed myself while ago. Since
> > > >> > you posted have you been able to replicate... or resolve?
> > > >> >
> > > >> >
> > > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> > > >> > <[email protected]> wrote:
> > > >> >
> > > >> >         I've used crawl to ensure config is correct and I don't get
> > > >> >         any errors,
> > > >> >         so I must be doing something wrong with the individual
> > steps,
> > > >> >         but can;t
> > > >> >         see what.
> > > >> >
> > > >> >
> > > >>
> > >
> > --------------------------------------------------------------------------------------------------------------------
> > > >> >
> > > >> >         llist@LeosLinux:~/nutchData
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >
> > > >> >
> > > >> >         crawl /home/llist/nutchData/seed/urls
> > > >> >         -dir /home/llist/nutchData/crawl
> > > >> >         -depth 3 -topN 5
> > > >> >         solrUrl is not set, indexing will be skipped...
> > > >> >         crawl started in: /home/llist/nutchData/crawl
> > > >> >         rootUrlDir = /home/llist/nutchData/seed/urls
> > > >> >         threads = 10
> > > >> >         depth = 3
> > > >> >         solrUrl=null
> > > >> >         topN = 5
> > > >> >         Injector: starting at 2011-07-17 09:31:19
> > > >> >
> > > >> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > > >> >
> > > >> >
> > > >> >         Injector: urlDir: /home/llist/nutchData/seed/urls
> > > >> >
> > > >> >         Injector: Converting injected urls to crawl db entries.
> > > >> >         Injector: Merging injected urls into crawl db.
> > > >> >
> > > >> >
> > > >> >         Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
> > > >> >         Generator: starting at 2011-07-17 09:31:22
> > > >> >
> > > >> >         Generator: Selecting best-scoring urls due for fetch.
> > > >> >         Generator: filtering: true
> > > >> >         Generator: normalizing: true
> > > >> >
> > > >> >
> > > >> >         Generator: topN: 5
> > > >> >
> > > >> >         Generator: jobtracker is 'local', generating exactly one
> > > >> >         partition.
> > > >> >         Generator: Partitioning selected urls for politeness.
> > > >> >
> > > >> >
> > > >> >         Generator:
> > > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > > >> >         Generator: finished at 2011-07-17 09:31:26, elapsed:
> > 00:00:04
> > > >> >
> > > >> >         Fetcher: Your 'http.agent.name' value should be listed
> > first
> > > >> >         in
> > > >> >         'http.robots.agents' property.
> > > >> >
> > > >> >
> > > >> >         Fetcher: starting at 2011-07-17 09:31:26
> > > >> >         Fetcher:
> > > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > > >> >
> > > >> >         Fetcher: threads: 10
> > > >> >         QueueFeeder finished: total 1 records + hit by time limit :0
> > > >> >         fetching http://www.seek.com.au/
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > >> >         -finishing thread FetcherThread, activeThreads=0
> > > >> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > >> >         -activeThreads=0
> > > >> >
> > > >> >
> > > >> >         Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
> > > >> >         ParseSegment: starting at 2011-07-17 09:31:29
> > > >> >         ParseSegment:
> > > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > > >> >         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
> > > >> >         00:00:02
> > > >> >         CrawlDb update: starting at 2011-07-17 09:31:32
> > > >> >
> > > >> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > > >> >         CrawlDb update: segments:
> > > >> >
> > > >> >
> > > >> >         [/home/llist/nutchData/crawl/segments/20110717093124]
> > > >> >
> > > >> >         CrawlDb update: additions allowed: true
> > > >> >
> > > >> >
> > > >> >         CrawlDb update: URL normalizing: true
> > > >> >         CrawlDb update: URL filtering: true
> > > >> >
> > > >> >         CrawlDb update: Merging segment data into db.
> > > >> >
> > > >> >
> > > >> >         CrawlDb update: finished at 2011-07-17 09:31:34, elapsed:
> > > >> >         00:00:02
> > > >> >         :
> > > >> >         :
> > > >> >         :
> > > >> >         :
> > > >> >
> > > >>
> > >
> > -----------------------------------------------------------------------------------------------
> > > >> >
> > > >> >
> > > >> >
> > > >> >         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
> > > >> >
> > > >> >         > Done, but now get additional errors:
> > > >> >         >
> > > >> >         > -------------------
> > > >> >         > llist@LeosLinux:~/nutchData
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >         > updatedb /home/llist/nutchData/crawl/crawldb
> > > >> >         > -dir /home/llist/nutchData/crawl/segments/20110716105826
> > > >> >         > CrawlDb update: starting at 2011-07-16 11:03:56
> > > >> >         > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > > >> >         > CrawlDb update: segments:
> > > >> >         >
> > > >> >
> > > >> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> > > >> >         >
> > > >> >
> > > file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> > > >> >         >
> > > >> >
> > > >>
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> > > >> >         > CrawlDb update: additions allowed: true
> > > >> >         > CrawlDb update: URL normalizing: false
> > > >> >         > CrawlDb update: URL filtering: false
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > file:/home/llist/nutchData/crawl/segments/20110716105826/content
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > >>
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> > > >> >         > CrawlDb update: Merging segment data into db.
> > > >> >         > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed:
> > > >> >         00:00:01
> > > >> >         > -------------------------------------------
> > > >> >         >
> > > >> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
> > > >> >         >
> > > >> >         > > fetch, then parse.
> > > >> >         > >
> > > >> >         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are
> > > >> >         the commands and
> > > >> >         > > > relevant output.
> > > >> >         > > >
> > > >> >         > > > ----------------------------------
> > > >> >         > > > llist@LeosLinux:~
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >         > > >
> > > >> >         inject /home/llist/nutchData/crawl/crawldb
> > > >> /home/llist/nutchData/seed
> > > >> >         > > > Injector: starting at 2011-07-15 18:32:10
> > > >> >         > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > > >> >         > > > Injector: urlDir: /home/llist/nutchData/seed
> > > >> >         > > > Injector: Converting injected urls to crawl db
> > entries.
> > > >> >         > > > Injector: Merging injected urls into crawl db.
> > > >> >         > > > Injector: finished at 2011-07-15 18:32:13, elapsed:
> > > >> >         00:00:02
> > > >> >         > > > =================
> > > >> >         > > > llist@LeosLinux:~
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >         > > > generate /home/llist/nutchData/crawl/crawldb
> > > >> >         > > > /home/llist/nutchData/crawl/segments Generator:
> > starting
> > > >> >         at 2011-07-15
> > > >> >         > > > 18:32:41
> > > >> >         > > > Generator: Selecting best-scoring urls due for fetch.
> > > >> >         > > > Generator: filtering: true
> > > >> >         > > > Generator: normalizing: true
> > > >> >         > > > Generator: jobtracker is 'local', generating exactly
> > one
> > > >> >         partition.
> > > >> >         > > > Generator: Partitioning selected urls for politeness.
> > > >> >         > > > Generator:
> > > >> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> > > >> >         > > > Generator: finished at 2011-07-15 18:32:45, elapsed:
> > > >> >         00:00:03
> > > >> >         > > > ==================
> > > >> >         > > > llist@LeosLinux:~
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >         > > >
> > > >> >         fetch /home/llist/nutchData/crawl/segments/20110715183244
> > > >> >         > > > Fetcher: Your 'http.agent.name' value should be
> > listed
> > > >> >         first in
> > > >> >         > > > 'http.robots.agents' property.
> > > >> >         > > > Fetcher: starting at 2011-07-15 18:34:55
> > > >> >         > > > Fetcher:
> > > >> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> > > >> >         > > > Fetcher: threads: 10
> > > >> >         > > > QueueFeeder finished: total 1 records + hit by time
> > > >> >         limit :0
> > > >> >         > > > fetching http://www.seek.com.au/
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=2
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -activeThreads=1, spinWaiting=0,
> > fetchQueues.totalSize=0
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=0
> > > >> >         > > > -activeThreads=0, spinWaiting=0,
> > fetchQueues.totalSize=0
> > > >> >         > > > -activeThreads=0
> > > >> >         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
> > > >> >         00:00:03
> > > >> >         > > > =================
> > > >> >         > > > llist@LeosLinux:~
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >         > > > updatedb /home/llist/nutchData/crawl/crawldb
> > > >> >         > > > -dir
> > /home/llist/nutchData/crawl/segments/20110715183244
> > > >> >         > > > CrawlDb update: starting at 2011-07-15 18:36:00
> > > >> >         > > > CrawlDb update: db:
> > /home/llist/nutchData/crawl/crawldb
> > > >> >         > > > CrawlDb update: segments:
> > > >> >         > > >
> > > >> >
> > > >> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> > > >> >         > > >
> > > >> >
> > > >>
> > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> > > >> >         > > >
> > > >> >
> > > file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> > > >> >         > > > CrawlDb update: additions allowed: true
> > > >> >         > > > CrawlDb update: URL normalizing: false
> > > >> >         > > > CrawlDb update: URL filtering: false
> > > >> >         > > > - skipping invalid segment
> > > >> >         > > >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> > > >> >         > > > - skipping invalid segment
> > > >> >         > > >
> > > >> >
> > > >>
> > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> > > >> >         > > > - skipping invalid segment
> > > >> >         > > >
> > > >> >
> > > file:/home/llist/nutchData/crawl/segments/20110715183244/content
> > > >> >         > > > CrawlDb update: Merging segment data into db.
> > > >> >         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
> > > >> >         elapsed: 00:00:01
> > > >> >         > > > -----------------------------------
> > > >> >         > > >
> > > >> >         > > > Appreciate any hints on what I'm missing.
> > > >> >         >
> > > >> >         >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Lewis
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > >
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
> 
> 
>

Re: skipping invalid segments nutch 1.3

Reply via email to