Re: nutch 1.1 crawl d/n complete issue

2010-04-16 Thread Phil Barnett
The Fix.

In line 131 of Crawl.java

Generate no longer returns segments like it used to. Now it returns segs.

line 131 needs to read

 If (segs == null)

 Instead of the current

If (segments == null)

After that change and a recompile, crawl is working just fine.


Re: nutch 1.1 crawl d/n complete issue

2010-04-15 Thread Harry Nutch
I am new to nutch and still trying to figure out the code flow, however, as
a work around to issue #1, after the crawl finishes you could run linkdb and
index command separately from cygwin.

$bin/nutch invertlinks crawl/linkdb -dir crawl/segments

$ bin/nutch index crawl/indexes crawl/crawldb/  crawl/linkdb
crawl/segments/20100415163946  crawl/segments/20100415164106

This seems to work for me. You may have already tried this workaround, but
just in case.

-Harry

On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius mgris...@comcast.netwrote:

 Two observations using the nutch 1.1. nightly build
 nutch-2010-04-14_04-00-47:

 1) Previously I was using nutch 1.0 to crawl successfully, but had
 problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which
 appears to parse all of the 'problem' pdfs that parse-pdf could not
 handle. The crawldb and segments directories are created and appear to
 be valid. However, the overall crawl does not finish now:

 nutch crawl urls/urls -dir crawl -depth 10
 ...
 Fetcher: done
 CrawlDb update: starting
 CrawlDb update: db: crawl/crawldb
 CrawlDb update: segments: [crawl/segments/20100415015102]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: Merging segment data into db.
 CrawlDb update: done
 Generator: Selecting best-scoring urls due for fetch.
 Generator: starting
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: 0 records selected for fetching, exiting ...
 Exception in thread main java.lang.NullPointerException
at org.apache.nutch.crawl.Crawl.main(Crawl.java:133)

 Nutch 1.0 would complete like this:

 nutch crawl urls/urls -dir crawl -depth 10
 ...
 Generator: 0 records selected for fetching, exiting ...
 Stopping at depth=7 - no more URLs to fetch.
 LinkDb: starting
 LinkDb: linkdb: crawl/linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 LinkDb: adding segment:
 file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731
 LinkDb: adding segment:
 file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644
 LinkDb: adding segment:
 file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749
 LinkDb: adding segment:
 file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808
 LinkDb: adding segment:
 file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713
 LinkDb: adding segment:
 file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937
 LinkDb: adding segment:
 file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656
 LinkDb: done
 Indexer: starting
 Indexer: done
 Dedup: starting
 Dedup: adding indexes in: crawl/indexes
 Dedup: done
 merging indexes to: crawl/index
 Adding
 file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-0
 done merging
 crawl finished: crawl

 Any ideas?


 2) if there is a 'space' in any component dir then $NUTCH_OPTS is
 invalid and causes this problem:

 m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin nutch
 crawl urls/urls -dir crawl -depth 10 -topN 10
 NUTCH_OPTS:  -Dhadoop.log.dir=/home/mag/Desktop/untitled
 folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log
 -Djava.library.path=/home/mag/Desktop/untitled
 folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32
 Exception in thread main java.lang.NoClassDefFoundError:
 folder/nutch-2010-04-14_04-00-47/logs
 Caused by: java.lang.ClassNotFoundException:
 folder.nutch-2010-04-14_04-00-47.logs
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
 Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs.
 Program will exit.
 m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin

 Obviously the work around is to rename 'untitled folder' to
 'untitledFolderWithNoSpaces'

 Thanks, any help w/b appreciated w/ issue #1 above.

 -m.





Re: nutch 1.1 crawl d/n complete issue

2010-04-15 Thread matthew a. grisius
Hi Harry,

Yes indeed. It appears to work for me too. Thank you!

nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/crawl/segments/20100415221103
LinkDb: adding segment: file:/crawl/segments/20100415221122
LinkDb: adding segment: file:/crawl/segments/20100415221141
LinkDb: adding segment: file:/crawl/segments/20100415221032
LinkDb: adding segment: file:/crawl/segments/20100415221019
LinkDb: adding segment: file:/crawl/segments/20100415221046
LinkDb: done

nutch index crawl/indexes crawl/crawldb/  crawl/linkdb
crawl/segments/20100415221103 crawl/segments/20100415221122
crawl/segments/20100415221141 crawl/segments/20100415221032
crawl/segments/20100415221019 crawl/segments/20100415221046
Indexer: starting
Indexer: done

1) After 'verifying' that the Harry nutch 1.1 work around can complete
the work for my 'small test crawl'. How do I scale the 'index step'
above when the data increases  50x and I can no longer fit the segments
onto a command line? Maybe this is a 'non-issue', hopefully this will be
fixed before the Nutch 1.1 Release Candidate #1 is voted in.

2) Additionally, now I have lost my ability to peer into the data
structures. Both luke 0.9.9.1 and luke 1.0.1 report:

No valid directory at the location, try another location.

Oh well, any suggestions to #1 or #2 w/b appreciated. Thanks again!

-m.


On Fri, 2010-04-16 at 08:44 +0800, Harry Nutch wrote:
 I am new to nutch and still trying to figure out the code flow, however, as
 a work around to issue #1, after the crawl finishes you could run linkdb and
 index command separately from cygwin.
 
 $bin/nutch invertlinks crawl/linkdb -dir crawl/segments
 
 $ bin/nutch index crawl/indexes crawl/crawldb/  crawl/linkdb
 crawl/segments/20100415163946  crawl/segments/20100415164106
 
 This seems to work for me. You may have already tried this workaround, but
 just in case.
 
 -Harry
 
 On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius 
 mgris...@comcast.netwrote:
 
  Two observations using the nutch 1.1. nightly build
  nutch-2010-04-14_04-00-47:
 
  1) Previously I was using nutch 1.0 to crawl successfully, but had
  problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which
  appears to parse all of the 'problem' pdfs that parse-pdf could not
  handle. The crawldb and segments directories are created and appear to
  be valid. However, the overall crawl does not finish now:
 
  nutch crawl urls/urls -dir crawl -depth 10
  ...
  Fetcher: done
  CrawlDb update: starting
  CrawlDb update: db: crawl/crawldb
  CrawlDb update: segments: [crawl/segments/20100415015102]
  CrawlDb update: additions allowed: true
  CrawlDb update: URL normalizing: true
  CrawlDb update: URL filtering: true
  CrawlDb update: Merging segment data into db.
  CrawlDb update: done
  Generator: Selecting best-scoring urls due for fetch.
  Generator: starting
  Generator: filtering: true
  Generator: normalizing: true
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: 0 records selected for fetching, exiting ...
  Exception in thread main java.lang.NullPointerException
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:133)
 
  Nutch 1.0 would complete like this:
 
  nutch crawl urls/urls -dir crawl -depth 10
  ...
  Generator: 0 records selected for fetching, exiting ...
  Stopping at depth=7 - no more URLs to fetch.
  LinkDb: starting
  LinkDb: linkdb: crawl/linkdb
  LinkDb: URL normalize: true
  LinkDb: URL filter: true
  LinkDb: adding segment:
  file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731
  LinkDb: adding segment:
  file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644
  LinkDb: adding segment:
  file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749
  LinkDb: adding segment:
  file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808
  LinkDb: adding segment:
  file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713
  LinkDb: adding segment:
  file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937
  LinkDb: adding segment:
  file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656
  LinkDb: done
  Indexer: starting
  Indexer: done
  Dedup: starting
  Dedup: adding indexes in: crawl/indexes
  Dedup: done
  merging indexes to: crawl/index
  Adding
  file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-0
  done merging
  crawl finished: crawl
 
  Any ideas?
 
 
  2) if there is a 'space' in any component dir then $NUTCH_OPTS is
  invalid and causes this problem:
 
  m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin nutch
  crawl urls/urls -dir crawl -depth 10 -topN 10
  NUTCH_OPTS:  -Dhadoop.log.dir=/home/mag/Desktop/untitled
  folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log
  -Djava.library.path=/home/mag/Desktop/untitled