Hadoop Disk Error
We're just now moving from a nutch .9 installation to 1.0, so I'm not entirely new to this. However, I can't even get past the first fetch now, due to a hadoop error. Looking in the mailing list archives, normally this error is caused from either permissions or a full disk. I overrode the use of /tmp by setting hadoop.tmp.dir to a place with plenty of space, and I'm running the crawl as root, yet I'm still getting the error below. Any thoughts? Running on AIX with plenty of disk and RAM. 2010-04-16 12:49:51,972 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2010-04-16 12:49:52,267 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2010-04-16 12:49:52,268 INFO fetcher.Fetcher - -activeThreads=0, 2010-04-16 12:49:52,270 WARN mapred.LocalJobRunner - job_local_0005 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_0005/attempt_local_0005_m_00_0/output/spill0.out at org.apache.hadoop.fs.LocalDirAllocator $AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite (LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite (MapOutputFile.java:107) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill (MapTask.java:930) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush (MapTask.java:842) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.LocalJobRunner$Job.run (LocalJobRunner.java:138)
Re: Hadoop Disk Error
fwiw, the error does seem to be valid: from the taskTracker/jobcache directory, I only have something for job 1-4. ls -la total 0 drwxr-xr-x6 root system 256 Apr 16 19:01 . drwxr-xr-x3 root system 256 Apr 16 19:01 .. drwxr-xr-x4 root system 256 Apr 16 19:01 job_local_0001 drwxr-xr-x4 root system 256 Apr 16 19:01 job_local_0002 drwxr-xr-x4 root system 256 Apr 16 19:01 job_local_0003 drwxr-xr-x4 root system 256 Apr 16 19:01 job_local_0004 | | From: | | --| |Joshua J Pavel/Raleigh/i...@ibmus | --| | | To:| | --| |nutch-user@lucene.apache.org | --| | | Date: | | --| |04/16/2010 09:00 AM | --| | | Subject: | | --| |Hadoop Disk Error | --| We're just now moving from a nutch .9 installation to 1.0, so I'm not entirely new to this. However, I can't even get past the first fetch now, due to a hadoop error. Looking in the mailing list archives, normally this error is caused from either permissions or a full disk. I overrode the use of /tmp by setting hadoop.tmp.dir to a place with plenty of space, and I'm running the crawl as root, yet I'm still getting the error below. Any thoughts? Running on AIX with plenty of disk and RAM. 2010-04-16 12:49:51,972 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2010-04-16 12:49:52,267 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2010-04-16 12:49:52,268 INFO fetcher.Fetcher - -activeThreads=0, 2010-04-16 12:49:52,270 WARN mapred.LocalJobRunner - job_local_0005 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_0005/attempt_local_0005_m_00_0/output/spill0.out at org.apache.hadoop.fs.LocalDirAllocator $AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite (LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite (MapOutputFile.java:107) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill (MapTask.java:930) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush (MapTask.java:842) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.LocalJobRunner$Job.run (LocalJobRunner.java:138)
nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
bin/nutch crawl urls -dir crawl -depth 3 -topN 50 where urls directory contains urls.txt which contains http://www.fmforums.com/ where crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ note - my nutch setup indexes other sites fine. for example where urls directory contains urls.txt which contains http://dispatch.neocodesoftware.com where crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ generates a good crawl... i know i have a known good install so why does nutch say No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com??? also fmforums.com/robots.txt looks ok: ### # # sample robots.txt file for this website # # addresses all robots by using wild card * User-agent: * # # list folders robots are not allowed to index #Disallow: /tutorials/404redirect/ Disallow: # # list specific files robots are not allowed to index #Disallow: /tutorials/custom_error_page.html Disallow: # # list the location of any sitemaps Sitemap: http://www.yourdomain.com/site_index.xml # # End of robots.txt file # ### -- View this message in context: http://n3.nabble.com/nutch-says-No-URLs-to-fetch-check-your-seed-list-and-URL-filters-when-trying-to-index-fmforums-com-tp724973p724973.html Sent from the Nutch - User mailing list archive at Nabble.com.
nutch 1.1 crawl d/n complete issue
Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47: 1) I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The crawldb and segments directories are created and appear to be valid. However, the overall crawl does not finish now: nutch crawl urls/urls -dir crawl -depth 10 ... Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20100415015102] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Exception in thread main java.lang.NullPointerException at org.apache.nutch.crawl.Crawl.main(Crawl.java:133) Nutch 1.0 would complete like this: nutch crawl urls/urls -dir crawl -depth 10 ... Generator: 0 records selected for fetching, exiting ... Stopping at depth=7 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937 LinkDb: adding segment: file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656 LinkDb: done Indexer: starting Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Dedup: done merging indexes to: crawl/index Adding file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-0 done merging crawl finished: crawl Any ideas? 2) if there is a 'space' in any component dir then $NUTCH_OPTS is invalid and causes this problem: m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin nutch crawl urls/urls -dir crawl -depth 10 -topN 10 NUTCH_OPTS: -Dhadoop.log.dir=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log -Djava.library.path=/home/mag/Desktop/untitled folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32 Exception in thread main java.lang.NoClassDefFoundError: folder/nutch-2010-04-14_04-00-47/logs Caused by: java.lang.ClassNotFoundException: folder.nutch-2010-04-14_04-00-47.logs at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs. Program will exit. m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin Obviously the work around is to rename 'untitled folder' to 'untitledFolderWithNoSpaces' Thanks, any help w/b appreciated w/ issue #1 above. -m.
Re: nutch 1.1 crawl d/n complete issue
The Fix. In line 131 of Crawl.java Generate no longer returns segments like it used to. Now it returns segs. line 131 needs to read If (segs == null) Instead of the current If (segments == null) After that change and a recompile, crawl is working just fine.
Re: About Apache Nutch 1.1 Final Release
On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote: More details on this (your environment, OS, JDK version) and logs/stacktraces would be highly appreciated! You mentioned that you have some scripts - if you could extract relevant portions from them (or copy the scripts) it would help us to ensure that it's not a simple command-line error. I posted another thread tonight with the fixed code. Can you please commit it for all of us? Thanks.