Hadoop Disk Error

2010-04-16 Thread Joshua J Pavel


We're just now moving from a nutch .9 installation to 1.0, so I'm not
entirely new to this.  However, I can't even get past the first fetch now,
due to a hadoop error.

Looking in the mailing list archives, normally this error is caused from
either permissions or a full disk.  I overrode the use of /tmp by setting
hadoop.tmp.dir to a place with plenty of space, and I'm running the crawl
as root, yet I'm still getting the error below.

Any thoughts?

Running on AIX with plenty of disk and RAM.

2010-04-16 12:49:51,972 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-04-16 12:49:52,267 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-04-16 12:49:52,268 INFO  fetcher.Fetcher - -activeThreads=0,
2010-04-16 12:49:52,270 WARN  mapred.LocalJobRunner - job_local_0005
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_local_0005/attempt_local_0005_m_00_0/output/spill0.out
at org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill
(MapTask.java:930)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush
(MapTask.java:842)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run
(LocalJobRunner.java:138)

Re: Hadoop Disk Error

2010-04-16 Thread Joshua J Pavel

fwiw, the error does seem to be valid: from the taskTracker/jobcache
directory, I only have something for job 1-4.

ls -la
total 0
drwxr-xr-x6 root system  256 Apr 16 19:01 .
drwxr-xr-x3 root system  256 Apr 16 19:01 ..
drwxr-xr-x4 root system  256 Apr 16 19:01 job_local_0001
drwxr-xr-x4 root system  256 Apr 16 19:01 job_local_0002
drwxr-xr-x4 root system  256 Apr 16 19:01 job_local_0003
drwxr-xr-x4 root system  256 Apr 16 19:01 job_local_0004


|
| From:  |
|
  
--|
  |Joshua J Pavel/Raleigh/i...@ibmus
  |
  
--|
|
| To:|
|
  
--|
  |nutch-user@lucene.apache.org 
 |
  
--|
|
| Date:  |
|
  
--|
  |04/16/2010 09:00 AM  
 |
  
--|
|
| Subject:   |
|
  
--|
  |Hadoop Disk Error
 |
  
--|







We're just now moving from a nutch .9 installation to 1.0, so I'm not
entirely new to this.  However, I can't even get past the first fetch now,
due to a hadoop error.

Looking in the mailing list archives, normally this error is caused from
either permissions or a full disk.  I overrode the use of /tmp by setting
hadoop.tmp.dir to a place with plenty of space, and I'm running the crawl
as root, yet I'm still getting the error below.

Any thoughts?

Running on AIX with plenty of disk and RAM.

2010-04-16 12:49:51,972 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-04-16 12:49:52,267 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-04-16 12:49:52,268 INFO  fetcher.Fetcher - -activeThreads=0,
2010-04-16 12:49:52,270 WARN  mapred.LocalJobRunner - job_local_0005
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_local_0005/attempt_local_0005_m_00_0/output/spill0.out

at org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill
(MapTask.java:930)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush
(MapTask.java:842)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run
(LocalJobRunner.java:138)


nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-16 Thread joshuasottpaul

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

where urls directory contains urls.txt which contains
http://www.fmforums.com/

where crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ 

note - my nutch setup indexes other sites fine. for example

where urls directory contains urls.txt which contains
http://dispatch.neocodesoftware.com

where crawl-urlfilter.txt contains
+^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/

generates a good crawl... 

i know i have a known good install

so why does nutch say No URLs to fetch - check your seed list and URL
filters when trying to index fmforums.com???

also fmforums.com/robots.txt looks ok:

###
#
# sample robots.txt file for this website 
#
# addresses all robots by using wild card *
User-agent: *
#
# list folders robots are not allowed to index
#Disallow: /tutorials/404redirect/
Disallow:
#
# list specific files robots are not allowed to index
#Disallow: /tutorials/custom_error_page.html
Disallow: 
#
# list the location of any sitemaps
Sitemap: http://www.yourdomain.com/site_index.xml
#
# End of robots.txt file
#
###
-- 
View this message in context: 
http://n3.nabble.com/nutch-says-No-URLs-to-fetch-check-your-seed-list-and-URL-filters-when-trying-to-index-fmforums-com-tp724973p724973.html
Sent from the Nutch - User mailing list archive at Nabble.com.


nutch 1.1 crawl d/n complete issue

2010-04-16 Thread matthew a. grisius
Two observations using the nutch 1.1. nightly build
nutch-2010-04-14_04-00-47:

1) I was using nutch 1.0 to crawl successfully, but had problems w/
parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to
parse all of the 'problem' pdfs that parse-pdf could not handle. The
crawldb and segments directories are created and appear to be valid.
However, the overall crawl does not finish now:

nutch crawl urls/urls -dir crawl -depth 10
...
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20100415015102]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Exception in thread main java.lang.NullPointerException
at org.apache.nutch.crawl.Crawl.main(Crawl.java:133)

Nutch 1.0 would complete like this:

nutch crawl urls/urls -dir crawl -depth 10
...
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=7 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937
LinkDb: adding segment:
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656
LinkDb: done
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Dedup: done
merging indexes to: crawl/index
Adding
file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-0
done merging
crawl finished: crawl

Any ideas?


2) if there is a 'space' in any component dir then $NUTCH_OPTS is
invalid and causes this problem:

m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin nutch
crawl urls/urls -dir crawl -depth 10 -topN 10
NUTCH_OPTS:  -Dhadoop.log.dir=/home/mag/Desktop/untitled
folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log
-Djava.library.path=/home/mag/Desktop/untitled
folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32
Exception in thread main java.lang.NoClassDefFoundError:
folder/nutch-2010-04-14_04-00-47/logs
Caused by: java.lang.ClassNotFoundException:
folder.nutch-2010-04-14_04-00-47.logs
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs.
Program will exit.
m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin 

Obviously the work around is to rename 'untitled folder' to
'untitledFolderWithNoSpaces'

Thanks, any help w/b appreciated w/ issue #1 above.

-m.



Re: nutch 1.1 crawl d/n complete issue

2010-04-16 Thread Phil Barnett
The Fix.

In line 131 of Crawl.java

Generate no longer returns segments like it used to. Now it returns segs.

line 131 needs to read

 If (segs == null)

 Instead of the current

If (segments == null)

After that change and a recompile, crawl is working just fine.


Re: About Apache Nutch 1.1 Final Release

2010-04-16 Thread Phil Barnett
On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote:

 More details on this (your environment, OS, JDK version) and
 logs/stacktraces would be highly appreciated! You mentioned that you
 have some scripts - if you could extract relevant portions from them (or
 copy the scripts) it would help us to ensure that it's not a simple
 command-line error.

I posted another thread tonight with the fixed code.

Can you please commit it for all of us?

Thanks.