nutch 1.1 crawl d/n complete issue
Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47: 1) I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The crawldb and segments directories are created and appear to be valid. However, the overall crawl does not finish now: nutch crawl urls/urls -dir crawl -depth 10 ... Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20100415015102] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Exception in thread main java.lang.NullPointerException at org.apache.nutch.crawl.Crawl.main(Crawl.java:133) Nutch 1.0 would complete like this: nutch crawl urls/urls -dir crawl -depth 10 ... Generator: 0 records selected for fetching, exiting ... Stopping at depth=7 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656 LinkDb: done Indexer: starting Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Dedup: done merging indexes to: crawl/index Adding file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-0 done merging crawl finished: crawl Any ideas? 2) if there is a 'space' in any component dir then $NUTCH_OPTS is invalid and causes this problem: m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin nutch crawl urls/urls -dir crawl -depth 10 -topN 10 NUTCH_OPTS: -Dhadoop.log.dir=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log -Djava.library.path=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32 Exception in thread main java.lang.NoClassDefFoundError: folder/nutch-2010-04-14_04-00-47/logs Caused by: java.lang.ClassNotFoundException: folder.nutch-2010-04-14_04-00-47.logs at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs. Program will exit. m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin Obviously the work around is to rename 'untitled folder' to 'untitledFolderWithNoSpaces' Thanks, any help w/b appreciated w/ issue #1 above. -m.
Re: nutch 1.1 crawl d/n complete issue
The Fix. In line 131 of Crawl.java Generate no longer returns segments like it used to. Now it returns segs. line 131 needs to read If (segs == null) Instead of the current If (segments == null) After that change and a recompile, crawl is working just fine.
nutch 1.1 crawl d/n complete issue
Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47: 1) Previously I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The crawldb and segments directories are created and appear to be valid. However, the overall crawl does not finish now: nutch crawl urls/urls -dir crawl -depth 10 ... Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20100415015102] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Exception in thread main java.lang.NullPointerException at org.apache.nutch.crawl.Crawl.main(Crawl.java:133) Nutch 1.0 would complete like this: nutch crawl urls/urls -dir crawl -depth 10 ... Generator: 0 records selected for fetching, exiting ... Stopping at depth=7 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656 LinkDb: done Indexer: starting Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Dedup: done merging indexes to: crawl/index Adding file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-0 done merging crawl finished: crawl Any ideas? 2) if there is a 'space' in any component dir then $NUTCH_OPTS is invalid and causes this problem: m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin nutch crawl urls/urls -dir crawl -depth 10 -topN 10 NUTCH_OPTS: -Dhadoop.log.dir=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log -Djava.library.path=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32 Exception in thread main java.lang.NoClassDefFoundError: folder/nutch-2010-04-14_04-00-47/logs Caused by: java.lang.ClassNotFoundException: folder.nutch-2010-04-14_04-00-47.logs at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs. Program will exit. m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin Obviously the work around is to rename 'untitled folder' to 'untitledFolderWithNoSpaces' Thanks, any help w/b appreciated w/ issue #1 above. -m.
Re: nutch 1.1 crawl d/n complete issue
I am new to nutch and still trying to figure out the code flow, however, as a work around to issue #1, after the crawl finishes you could run linkdb and index command separately from cygwin. $bin/nutch invertlinks crawl/linkdb -dir crawl/segments $ bin/nutch index crawl/indexes crawl/crawldb/ crawl/linkdb crawl/segments/20100415163946 crawl/segments/20100415164106 This seems to work for me. You may have already tried this workaround, but just in case. -Harry On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius mgris...@comcast.netwrote: Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47: 1) Previously I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The crawldb and segments directories are created and appear to be valid. However, the overall crawl does not finish now: nutch crawl urls/urls -dir crawl -depth 10 ... Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20100415015102] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Exception in thread main java.lang.NullPointerException at org.apache.nutch.crawl.Crawl.main(Crawl.java:133) Nutch 1.0 would complete like this: nutch crawl urls/urls -dir crawl -depth 10 ... Generator: 0 records selected for fetching, exiting ... Stopping at depth=7 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656 LinkDb: done Indexer: starting Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Dedup: done merging indexes to: crawl/index Adding file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-0 done merging crawl finished: crawl Any ideas? 2) if there is a 'space' in any component dir then $NUTCH_OPTS is invalid and causes this problem: m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin nutch crawl urls/urls -dir crawl -depth 10 -topN 10 NUTCH_OPTS: -Dhadoop.log.dir=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log -Djava.library.path=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32 Exception in thread main java.lang.NoClassDefFoundError: folder/nutch-2010-04-14_04-00-47/logs Caused by: java.lang.ClassNotFoundException: folder.nutch-2010-04-14_04-00-47.logs at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs. Program will exit. m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin Obviously the work around is to rename 'untitled folder' to 'untitledFolderWithNoSpaces' Thanks, any help w/b appreciated w/ issue #1 above. -m.
Re: nutch 1.1 crawl d/n complete issue
Hi Harry, Yes indeed. It appears to work for me too. Thank you! nutch invertlinks crawl/linkdb -dir crawl/segments LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/crawl/segments/20100415221103 LinkDb: adding segment: file:/crawl/segments/20100415221122 LinkDb: adding segment: file:/crawl/segments/20100415221141 LinkDb: adding segment: file:/crawl/segments/20100415221032 LinkDb: adding segment: file:/crawl/segments/20100415221019 LinkDb: adding segment: file:/crawl/segments/20100415221046 LinkDb: done nutch index crawl/indexes crawl/crawldb/ crawl/linkdb crawl/segments/20100415221103 crawl/segments/20100415221122 crawl/segments/20100415221141 crawl/segments/20100415221032 crawl/segments/20100415221019 crawl/segments/20100415221046 Indexer: starting Indexer: done 1) After 'verifying' that the Harry nutch 1.1 work around can complete the work for my 'small test crawl'. How do I scale the 'index step' above when the data increases 50x and I can no longer fit the segments onto a command line? Maybe this is a 'non-issue', hopefully this will be fixed before the Nutch 1.1 Release Candidate #1 is voted in. 2) Additionally, now I have lost my ability to peer into the data structures. Both luke 0.9.9.1 and luke 1.0.1 report: No valid directory at the location, try another location. Oh well, any suggestions to #1 or #2 w/b appreciated. Thanks again! -m. On Fri, 2010-04-16 at 08:44 +0800, Harry Nutch wrote: I am new to nutch and still trying to figure out the code flow, however, as a work around to issue #1, after the crawl finishes you could run linkdb and index command separately from cygwin. $bin/nutch invertlinks crawl/linkdb -dir crawl/segments $ bin/nutch index crawl/indexes crawl/crawldb/ crawl/linkdb crawl/segments/20100415163946 crawl/segments/20100415164106 This seems to work for me. You may have already tried this workaround, but just in case. -Harry On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius mgris...@comcast.netwrote: Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47: 1) Previously I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The crawldb and segments directories are created and appear to be valid. However, the overall crawl does not finish now: nutch crawl urls/urls -dir crawl -depth 10 ... Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20100415015102] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Exception in thread main java.lang.NullPointerException at org.apache.nutch.crawl.Crawl.main(Crawl.java:133) Nutch 1.0 would complete like this: nutch crawl urls/urls -dir crawl -depth 10 ... Generator: 0 records selected for fetching, exiting ... Stopping at depth=7 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656 LinkDb: done Indexer: starting Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Dedup: done merging indexes to: crawl/index Adding file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-0 done merging crawl finished: crawl Any ideas? 2) if there is a 'space' in any component dir then $NUTCH_OPTS is invalid and causes this problem: m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin nutch crawl urls/urls -dir crawl -depth 10 -topN 10 NUTCH_OPTS: -Dhadoop.log.dir=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log -Djava.library.path=/home/mag/Desktop/untitled