Re: Hadoop Disk Error

2010-04-27 Thread Andrzej Bialecki
On 2010-04-26 22:31, Joshua J Pavel wrote:
 
 Sending this out to close the thread if anyone else experiences this
 problem: nutch 1.0 is not AIX-friendly (0.9 is).
 
 I'm not 100% sure which command it may be, but by modifying my path so
 that /opt/freeware/bin has precedence, I no longer get the hadoop disk
 error.  While I though this means the problem comes from the nutch script,
 not the code itself, manually trying to set system calls
 to /opt/freeware/bin didn't fix it.  I assume until detailed debugging is
 done, further releases will also require a workaround similar to what I'm
 doing.

Ahhh ... now I understand. The problem lies in Hadoop's use of utilities
such as /bin/whoami, /bin/ls and /bin/df. These are used to obtain some
filesystem and permissions information that is otherwise not available
from JVM.

However, these utilities are expected to provide a POSIX-y output if on
Unix, or Cygwin output if on Windows. I guess the native commands in AIX
don't conform to either, so the output of these utilities can't be
parsed, which ultimately results in errors. Whereas the output of
/opt/freeware/bin utilities follows the POSIX format.

I'm not sure what was the difference in 0.9 that still made it work ...
perhaps the parsing of these outputs was more lenient, or some errors
were ignored.

In any case, we in Nutch can't do anything about this, we can just add
your workaround to the documentation. The problem should be reported to
the Hadoop project.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Hadoop Disk Error

2010-04-26 Thread Joshua J Pavel

Sending this out to close the thread if anyone else experiences this
problem: nutch 1.0 is not AIX-friendly (0.9 is).

I'm not 100% sure which command it may be, but by modifying my path so
that /opt/freeware/bin has precedence, I no longer get the hadoop disk
error.  While I though this means the problem comes from the nutch script,
not the code itself, manually trying to set system calls
to /opt/freeware/bin didn't fix it.  I assume until detailed debugging is
done, further releases will also require a workaround similar to what I'm
doing.



|
| From:  |
|
  
--|
  |Joshua J Pavel/Raleigh/i...@ibmus
  |
  
--|
|
| To:|
|
  
--|
  |nutch-user@lucene.apache.org 
 |
  
--|
|
| Date:  |
|
  
--|
  |04/21/2010 01:57 PM  
 |
  
--|
|
| Subject:   |
|
  
--|
  |Re: Hadoop Disk Error
 |
  
--|





Using 1.1, it looks like the same error at first:
threads = 10
depth = 5
indexer=lucene
Injector: starting
Injector: crawlDb: crawl-20100421175011/crawldb
Injector: urlDir: /projects/events/search/apache-nutch-1.1/cmrolg-even/urls
Injector: Converting injected urls to crawl db entries.
Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:211)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)


But I think the log shows me that I have some build problems... true?

2010-04-21 17:50:14,621 WARN plugin.PluginRepository - Plugins: not a file:
url. Can't load plugins from:
jar:file:/projects/events/search/apache-nutch-1.1/nutch-1.1.job!/plugins
2010-04-21 17:50:14,623 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2010-04-21 17:50:14,623 INFO plugin.PluginRepository - Registered Plugins:
2010-04-21 17:50:14,623 INFO plugin.PluginRepository - NONE
2010-04-21 17:50:14,623 INFO plugin.PluginRepository - Registered
Extension-Points:
2010-04-21 17:50:14,623 INFO plugin.PluginRepository - NONE
2010-04-21 17:50:14,628 WARN mapred.LocalJobRunner - job_local_0001
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf
(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance
(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:45)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:591)
at org.apache.hadoop.util.ReflectionUtils.setJobConf
(ReflectionUtils.java:88)
... 5 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf
(ReflectionUtils.java:93)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at org.apache.hadoop.util.ReflectionUtils.newInstance
(ReflectionUtils.java:117

RE: Hadoop Disk Error

2010-04-21 Thread Joshua J Pavel

I get the same error on a filesystem with 10 GB (disk space is a commodity
here).  The final crawl when it succeeds on my Windows machine is 93 MB, so
I really hope it doesn't need more than 10 GB to even pull down and parse
the first URL.   Is there something concerning threading that could
introduce a job that gets started before the successfully completion of a
dependant job?  This is running on the same machine as .9 did successfully,
so the only difference is the JDK and the code.

Thanks again for taking a look at this with me.



|
| From:  |
|
  
--|
  |arkadi.kosmy...@csiro.au   
 |
  
--|
|
| To:|
|
  
--|
  |nutch-user@lucene.apache.org   
 |
  
--|
|
| Date:  |
|
  
--|
  |04/20/2010 06:30 PM  
 |
  
--|
|
| Subject:   |
|
  
--|
  |RE: Hadoop Disk Error
 |
  
--|





1 or even 2 GB are far from impressing. Why don't you switch hadoop.tmp.dir
to a place with, say, 50GB free? Your task may be successful on Windows
just because the temp space limit is different there.

From: Joshua J Pavel [mailto:jpa...@us.ibm.com]
Sent: Wednesday, 21 April 2010 3:40 AM
To: nutch-user@lucene.apache.org
Subject: Re: Hadoop Disk Error


Yes - how much free space does it need? We ran 0.9 using /tmp, and that has
~ 1 GB. After I first saw this error, I moved it to another filesystem
where I have 2 GB free (maybe not gigs and gigs, but more than I think I
need to complete a small test crawl?).

[cid:1__=0ABBFD98DFF359758f9e8a93df938@us.ibm.com]Julien Nioche
---04/20/2010 12:36:10 PM---Hi Joshua, The error message you got definitely
indicates that you are running out of

From:


Julien Nioche lists.digitalpeb...@gmail.com


To:


nutch-user@lucene.apache.org


Date:


04/20/2010 12:36 PM


Subject:


Re: Hadoop Disk Error





Hi Joshua,

The error message you got definitely indicates that you are running out of
space.  Have you changed the value of hadoop.tmp.dir in the config file?

J.

--
DigitalPebble Ltd
http://www.digitalpebble.com

On 20 April 2010 14:00, Joshua J Pavel jpa...@us.ibm.com wrote:

 I am - I changed the location to a filesystem with lots of free space and
 watched disk utilization during a crawl. It'll be a relatively small
crawl,
 and I have gigs and gigs free.

 [image: Inactive hide details for ---04/19/2010 05:53:53 PM---Are you
sure
 that you have enough space in the temporary directory used
b]---04/19/2010
 05:53:53 PM---Are you sure that you have enough space in the temporary
 directory used by Hadoop? From: Joshua J Pa


 From:
 arkadi.kosmy...@csiro.au
 To:
 nutch-user@lucene.apache.org
 Date:
 04/19/2010 05:53 PM
 Subject:
 RE: Hadoop Disk Error
 --



 Are you sure that you have enough space in the temporary directory used
by
 Hadoop?

 From: Joshua J Pavel [mailto:jpa...@us.ibm.com, jpa...@us.ibm.com]
 Sent: Tuesday, 20 April 2010 6:42 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Hadoop Disk Error


 Some more information, if anyone can help:

 If I turn fetcher.parse to false, then it successfully fetches and
crawls
 the site. and then bombs out with a larger ID for the job:

 2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could

Re: Hadoop Disk Error

2010-04-21 Thread Julien Nioche
Joshua,

Could you try using Nutch 1.1 RC1  (see
http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/)?
Could you also try separating the fetching and parsing steps? e.g fetch
first as you already do then parse the fetched segment (instead of parsing
while refetching)
Your crawl is fairly small so it should not require much space at all.

Thanks

Julien

On 21 April 2010 15:28, Joshua J Pavel jpa...@us.ibm.com wrote:

 I get the same error on a filesystem with 10 GB (disk space is a commodity
 here). The final crawl when it succeeds on my Windows machine is 93 MB, so I
 really hope it doesn't need more than 10 GB to even pull down and parse the
 first URL. Is there something concerning threading that could introduce a
 job that gets started before the successfully completion of a dependant job?
 This is running on the same machine as .9 did successfully, so the only
 difference is the JDK and the code.

 Thanks again for taking a look at this with me.


 [image: Inactive hide details for ---04/20/2010 06:30:07 PM---1 or even 2
 GB are far from impressing. Why don't you switch hadoop.tmp.d]---04/20/2010
 06:30:07 PM---1 or even 2 GB are far from impressing. Why don't you switch
 hadoop.tmp.dir to a place with, say, 50


 From:
 arkadi.kosmy...@csiro.au
 To:
 nutch-user@lucene.apache.org
 Date:
 04/20/2010 06:30 PM
 Subject:
 RE: Hadoop Disk Error
 --



 1 or even 2 GB are far from impressing. Why don't you switch hadoop.tmp.dir
 to a place with, say, 50GB free? Your task may be successful on Windows just
 because the temp space limit is different there.

 From: Joshua J Pavel [mailto:jpa...@us.ibm.com jpa...@us.ibm.com]
 Sent: Wednesday, 21 April 2010 3:40 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Hadoop Disk Error


 Yes - how much free space does it need? We ran 0.9 using /tmp, and that has
 ~ 1 GB. After I first saw this error, I moved it to another filesystem where
 I have 2 GB free (maybe not gigs and gigs, but more than I think I need to
 complete a small test crawl?).

 [cid:1__=0ABBFD98DFF359758f9e8a93df938@us.ibm.com]Julien Nioche
 ---04/20/2010 12:36:10 PM---Hi Joshua, The error message you got definitely
 indicates that you are running out of

 From:


 Julien Nioche lists.digitalpeb...@gmail.com


 To:


 nutch-user@lucene.apache.org


 Date:


 04/20/2010 12:36 PM


 Subject:


 Re: Hadoop Disk Error

 



 Hi Joshua,

 The error message you got definitely indicates that you are running out of
 space.  Have you changed the value of hadoop.tmp.dir in the config file?

 J.

 --
 DigitalPebble Ltd
 http://www.digitalpebble.com

 On 20 April 2010 14:00, Joshua J Pavel jpa...@us.ibm.com wrote:

  I am - I changed the location to a filesystem with lots of free space and
  watched disk utilization during a crawl. It'll be a relatively small
 crawl,
  and I have gigs and gigs free.
 
  [image: Inactive hide details for ---04/19/2010 05:53:53 PM---Are you
 sure
  that you have enough space in the temporary directory used
 b]---04/19/2010
  05:53:53 PM---Are you sure that you have enough space in the temporary
  directory used by Hadoop? From: Joshua J Pa
 
 
  From:
  arkadi.kosmy...@csiro.au
  To:
  nutch-user@lucene.apache.org
  Date:
  04/19/2010 05:53 PM
  Subject:
  RE: Hadoop Disk Error
  --
 
 
 
  Are you sure that you have enough space in the temporary directory used
 by
  Hadoop?
 
  From: Joshua J Pavel [mailto:jpa...@us.ibm.com jpa...@us.ibm.com 
 jpa...@us.ibm.com]
  Sent: Tuesday, 20 April 2010 6:42 AM
  To: nutch-user@lucene.apache.org
  Subject: Re: Hadoop Disk Error
 
 
  Some more information, if anyone can help:
 
  If I turn fetcher.parse to false, then it successfully fetches and
 crawls
  the site. and then bombs out with a larger ID for the job:
 
  2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010
  org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
  valid local directory for
 
 taskTracker/jobcache/job_local_0010/attempt_local_0010_m_00_0/output/spill0.out
  at
 
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
  at
 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
  at
 
 org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
  at
 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:930)
  at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:842)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
  at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 
  So, it's gotta be a problem with the parsing? The pages should all be
  UTF-8, and I know there are multiple languages involved. I tried setting
  parser.character.encoding.default to match, but it made no difference.
 I'd
  appreciate any ideas.
 
  [cid:1__

Re: Hadoop Disk Error

2010-04-21 Thread Joshua J Pavel
:   |
|
  
--|
  |Re: Hadoop Disk Error
 |
  
--|





Joshua,

Could you try using Nutch 1.1 RC1  (see
http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/)?
Could you also try separating the fetching and parsing steps? e.g fetch
first as you already do then parse the fetched segment (instead of parsing
while refetching)
Your crawl is fairly small so it should not require much space at all.

Thanks

Julien

On 21 April 2010 15:28, Joshua J Pavel jpa...@us.ibm.com wrote:

 I get the same error on a filesystem with 10 GB (disk space is a
commodity
 here). The final crawl when it succeeds on my Windows machine is 93 MB,
so I
 really hope it doesn't need more than 10 GB to even pull down and parse
the
 first URL. Is there something concerning threading that could introduce a
 job that gets started before the successfully completion of a dependant
job?
 This is running on the same machine as .9 did successfully, so the only
 difference is the JDK and the code.

 Thanks again for taking a look at this with me.


 [image: Inactive hide details for ---04/20/2010 06:30:07 PM---1 or even 2
 GB are far from impressing. Why don't you switch
hadoop.tmp.d]---04/20/2010
 06:30:07 PM---1 or even 2 GB are far from impressing. Why don't you
switch
 hadoop.tmp.dir to a place with, say, 50


 From:
 arkadi.kosmy...@csiro.au
 To:
 nutch-user@lucene.apache.org
 Date:
 04/20/2010 06:30 PM
 Subject:
 RE: Hadoop Disk Error
 --



 1 or even 2 GB are far from impressing. Why don't you switch
hadoop.tmp.dir
 to a place with, say, 50GB free? Your task may be successful on Windows
just
 because the temp space limit is different there.

 From: Joshua J Pavel [mailto:jpa...@us.ibm.com jpa...@us.ibm.com]
 Sent: Wednesday, 21 April 2010 3:40 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Hadoop Disk Error


 Yes - how much free space does it need? We ran 0.9 using /tmp, and that
has
 ~ 1 GB. After I first saw this error, I moved it to another filesystem
where
 I have 2 GB free (maybe not gigs and gigs, but more than I think I need
to
 complete a small test crawl?).

 [cid:1__=0ABBFD98DFF359758f9e8a93df938@us.ibm.com]Julien Nioche
 ---04/20/2010 12:36:10 PM---Hi Joshua, The error message you got
definitely
 indicates that you are running out of

 From:


 Julien Nioche lists.digitalpeb...@gmail.com


 To:


 nutch-user@lucene.apache.org


 Date:


 04/20/2010 12:36 PM


 Subject:


 Re: Hadoop Disk Error

 



 Hi Joshua,

 The error message you got definitely indicates that you are running out
of
 space.  Have you changed the value of hadoop.tmp.dir in the config file?

 J.

 --
 DigitalPebble Ltd
 http://www.digitalpebble.com

 On 20 April 2010 14:00, Joshua J Pavel jpa...@us.ibm.com wrote:

  I am - I changed the location to a filesystem with lots of free space
and
  watched disk utilization during a crawl. It'll be a relatively small
 crawl,
  and I have gigs and gigs free.
 
  [image: Inactive hide details for ---04/19/2010 05:53:53 PM---Are you
 sure
  that you have enough space in the temporary directory used
 b]---04/19/2010
  05:53:53 PM---Are you sure that you have enough space in the temporary
  directory used by Hadoop? From: Joshua J Pa
 
 
  From:
  arkadi.kosmy...@csiro.au
  To:
  nutch-user@lucene.apache.org
  Date:
  04/19/2010 05:53 PM
  Subject:
  RE: Hadoop Disk Error
  --
 
 
 
  Are you sure that you have enough space in the temporary directory used
 by
  Hadoop?
 
  From: Joshua J Pavel [mailto:jpa...@us.ibm.com: jpa...@us.ibm.com 
 jpa...@us.ibm.com]
  Sent: Tuesday, 20 April 2010 6:42 AM
  To: nutch-user@lucene.apache.org
  Subject: Re: Hadoop Disk Error
 
 
  Some more information, if anyone can help:
 
  If I turn fetcher.parse to false, then it successfully fetches and
 crawls
  the site. and then bombs out with a larger ID for the job:
 
  2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010
  org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
any
  valid local directory for
 

taskTracker/jobcache/job_local_0010/attempt_local_0010_m_00_0/output/spill0.out

  at
 
 org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
  at
 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124)
  at
 
 org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
(MapOutputFile.java:107)
  at
 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill
(MapTask.java:930

RE: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel

I am - I changed the location to a filesystem with lots of free space and
watched disk utilization during a crawl.  It'll be a relatively small
crawl, and I have gigs and gigs free.


|
| From:  |
|
  
--|
  |arkadi.kosmy...@csiro.au   
 |
  
--|
|
| To:|
|
  
--|
  |nutch-user@lucene.apache.org   
 |
  
--|
|
| Date:  |
|
  
--|
  |04/19/2010 05:53 PM  
 |
  
--|
|
| Subject:   |
|
  
--|
  |RE: Hadoop Disk Error
 |
  
--|





Are you sure that you have enough space in the temporary directory used by
Hadoop?

From: Joshua J Pavel [mailto:jpa...@us.ibm.com]
Sent: Tuesday, 20 April 2010 6:42 AM
To: nutch-user@lucene.apache.org
Subject: Re: Hadoop Disk Error


Some more information, if anyone can help:

If I turn fetcher.parse to false, then it successfully fetches and crawls
the site. and then bombs out with a larger ID for the job:

2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_local_0010/attempt_local_0010_m_00_0/output/spill0.out

at org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill
(MapTask.java:930)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:842)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

So, it's gotta be a problem with the parsing? The pages should all be
UTF-8, and I know there are multiple languages involved. I tried setting
parser.character.encoding.default to match, but it made no difference. I'd
appreciate any ideas.

[?cid:1__=0ABBFD99DFE290498f9e8a93df938@us.ibm.com]Joshua J
Pavel---04/16/2010 03:05:18 PM---fwiw, the error does seem to be valid:
from the taskTracker/jobcache directory, I only have somethin

From:


Joshua J Pavel/Raleigh/i...@ibmus


To:


nutch-user@lucene.apache.org


Date:


04/16/2010 03:05 PM


Subject:


Re: Hadoop Disk Error





fwiw, the error does seem to be valid: from the taskTracker/jobcache
directory, I only have something for job 1-4.

ls -la
total 0
drwxr-xr-x 6 root system 256 Apr 16 19:01 .
drwxr-xr-x 3 root system 256 Apr 16 19:01 ..
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0001
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0002
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0003
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0004

Joshua J Pavel---04/16/2010 09:00:35 AM---We're just now moving from a
nutch .9 installation to 1.0, so I'm not entirely new to this. However

From:


Joshua J Pavel/Raleigh/i...@ibmus


To:


nutch-user@lucene.apache.org


Date:


04/16/2010 09:00 AM


Subject:


Hadoop Disk Error







We're just now moving from a nutch .9 installation to 1.0, so I'm not
entirely new to this.  However, I can't even get past the first fetch now,
due to a hadoop error.

Looking

RE: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel

Apologies for filling the thread with troubleshooting.

I tried the same configuration on an identical server, and I still have the
same exact errors.  I used the same configuration on a Windows system over
cygwin, and it works successfully.  So now I'm wondering if there is some
incompatibility with my OS or Java?

I'm running nutch 1.0 on AIX 6.1.0.0, with:

java version 1.6.0
Java(TM) SE Runtime Environment (build pap6460sr5-20090529_04(SR5))
IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 AIX ppc64-64
jvmap6460sr5-20090519_35743 (JIT enabled, AOT enabled)
J9VM - 20090519_035743_BHdSMr
JIT  - r9_20090518_2017
GC   - 20090417_AA)
JCL  - 20090529_01)

It's the same OS as I was using to run Nutch 0.9, but with a different
version of Java.


|
| From:  |
|
  
--|
  |Joshua J Pavel/Raleigh/i...@ibmus
  |
  
--|
|
| To:|
|
  
--|
  |nutch-user@lucene.apache.org 
 |
  
--|
|
| Date:  |
|
  
--|
  |04/20/2010 09:01 AM  
 |
  
--|
|
| Subject:   |
|
  
--|
  |RE: Hadoop Disk Error
 |
  
--|





I am - I changed the location to a filesystem with lots of free space and
watched disk utilization during a crawl. It'll be a relatively small crawl,
and I have gigs and gigs free.

---04/19/2010 05:53:53 PM---Are you sure that you have enough space in the
temporary directory used by Hadoop? From: Joshua J Pa
   
   
 From: arkadi.kosmy...@csiro.au  
   
   
 To:   nutch-user@lucene.apache.org  
   
   
 Date: 04/19/2010 05:53 PM 
   
   
 Subject:  RE: Hadoop Disk Error   
   





Are you sure that you have enough space in the temporary directory used by
Hadoop?

From: Joshua J Pavel [mailto:jpa...@us.ibm.com.]
Sent: Tuesday, 20 April 2010 6:42 AM
To: nutch-user@lucene.apache.org
Subject: Re: Hadoop Disk Error


Some more information, if anyone can help:

If I turn fetcher.parse to false, then it successfully fetches and crawls
the site. and then bombs out with a larger ID for the job:

2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_local_0010/attempt_local_0010_m_00_0/output/spill0.out

at org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask

Re: Hadoop Disk Error

2010-04-20 Thread Julien Nioche
Hi Joshua,

The error message you got definitely indicates that you are running out of
space.  Have you changed the value of hadoop.tmp.dir in the config file?

J.

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 20 April 2010 14:00, Joshua J Pavel jpa...@us.ibm.com wrote:

 I am - I changed the location to a filesystem with lots of free space and
 watched disk utilization during a crawl. It'll be a relatively small crawl,
 and I have gigs and gigs free.

 [image: Inactive hide details for ---04/19/2010 05:53:53 PM---Are you sure
 that you have enough space in the temporary directory used b]---04/19/2010
 05:53:53 PM---Are you sure that you have enough space in the temporary
 directory used by Hadoop? From: Joshua J Pa


 From:
 arkadi.kosmy...@csiro.au
 To:
 nutch-user@lucene.apache.org
 Date:
 04/19/2010 05:53 PM
 Subject:
 RE: Hadoop Disk Error
 --



 Are you sure that you have enough space in the temporary directory used by
 Hadoop?

 From: Joshua J Pavel [mailto:jpa...@us.ibm.com jpa...@us.ibm.com]
 Sent: Tuesday, 20 April 2010 6:42 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Hadoop Disk Error


 Some more information, if anyone can help:

 If I turn fetcher.parse to false, then it successfully fetches and crawls
 the site. and then bombs out with a larger ID for the job:

 2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
 valid local directory for
 taskTracker/jobcache/job_local_0010/attempt_local_0010_m_00_0/output/spill0.out
 at
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
 at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
 at
 org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:930)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:842)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

 So, it's gotta be a problem with the parsing? The pages should all be
 UTF-8, and I know there are multiple languages involved. I tried setting
 parser.character.encoding.default to match, but it made no difference. I'd
 appreciate any ideas.

 [cid:1__=0ABBFD99DFE290498f9e8a93df938@us.ibm.com]Joshua J
 Pavel---04/16/2010 03:05:18 PM---fwiw, the error does seem to be valid: from
 the taskTracker/jobcache directory, I only have somethin

 From:


 Joshua J Pavel/Raleigh/i...@ibmus


 To:


 nutch-user@lucene.apache.org


 Date:


 04/16/2010 03:05 PM


 Subject:


 Re: Hadoop Disk Error

 



 fwiw, the error does seem to be valid: from the taskTracker/jobcache
 directory, I only have something for job 1-4.

 ls -la
 total 0
 drwxr-xr-x 6 root system 256 Apr 16 19:01 .
 drwxr-xr-x 3 root system 256 Apr 16 19:01 ..
 drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0001
 drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0002
 drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0003
 drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0004

 Joshua J Pavel---04/16/2010 09:00:35 AM---We're just now moving from a
 nutch .9 installation to 1.0, so I'm not entirely new to this. However

 From:


 Joshua J Pavel/Raleigh/i...@ibmus


 To:


 nutch-user@lucene.apache.org


 Date:


 04/16/2010 09:00 AM


 Subject:


 Hadoop Disk Error

 





 We're just now moving from a nutch .9 installation to 1.0, so I'm not
 entirely new to this.  However, I can't even get past the first fetch now,
 due to a hadoop error.

 Looking in the mailing list archives, normally this error is caused from
 either permissions or a full disk.  I overrode the use of /tmp by setting
 hadoop.tmp.dir to a place with plenty of space, and I'm running the crawl
 as root, yet I'm still getting the error below.

 Any thoughts?

 Running on AIX with plenty of disk and RAM.

 2010-04-16 12:49:51,972 INFO  fetcher.Fetcher - -finishing thread
 FetcherThread, activeThreads=0
 2010-04-16 12:49:52,267 INFO  fetcher.Fetcher - -activeThreads=0,
 spinWaiting=0, fetchQueues.totalSize=0
 2010-04-16 12:49:52,268 INFO  fetcher.Fetcher - -activeThreads=0,
 2010-04-16 12:49:52,270 WARN  mapred.LocalJobRunner - job_local_0005
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
 valid local directory for

 taskTracker/jobcache/job_local_0005/attempt_local_0005_m_00_0/output/spill0.out
  at org.apache.hadoop.fs.LocalDirAllocator
 $AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
  at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
 (LocalDirAllocator.java:124)
  at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
 (MapOutputFile.java:107

Re: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel

Yes - how much free space does it need?  We ran 0.9 using /tmp, and that
has ~ 1 GB.  After I first saw this error, I moved it to another filesystem
where I have 2 GB free (maybe not gigs and gigs, but more than I think I
need to complete a small test crawl?).


|
| From:  |
|
  
--|
  |Julien Nioche lists.digitalpeb...@gmail.com
 |
  
--|
|
| To:|
|
  
--|
  |nutch-user@lucene.apache.org 
 |
  
--|
|
| Date:  |
|
  
--|
  |04/20/2010 12:36 PM  
 |
  
--|
|
| Subject:   |
|
  
--|
  |Re: Hadoop Disk Error
 |
  
--|





Hi Joshua,

The error message you got definitely indicates that you are running out of
space.  Have you changed the value of hadoop.tmp.dir in the config file?

J.

--
DigitalPebble Ltd
http://www.digitalpebble.com

On 20 April 2010 14:00, Joshua J Pavel jpa...@us.ibm.com wrote:

 I am - I changed the location to a filesystem with lots of free space and
 watched disk utilization during a crawl. It'll be a relatively small
crawl,
 and I have gigs and gigs free.

 [image: Inactive hide details for ---04/19/2010 05:53:53 PM---Are you
sure
 that you have enough space in the temporary directory used
b]---04/19/2010
 05:53:53 PM---Are you sure that you have enough space in the temporary
 directory used by Hadoop? From: Joshua J Pa


 From:
 arkadi.kosmy...@csiro.au
 To:
 nutch-user@lucene.apache.org
 Date:
 04/19/2010 05:53 PM
 Subject:
 RE: Hadoop Disk Error
 --



 Are you sure that you have enough space in the temporary directory used
by
 Hadoop?

 From: Joshua J Pavel [mailto:jpa...@us.ibm.com, jpa...@us.ibm.com]
 Sent: Tuesday, 20 April 2010 6:42 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Hadoop Disk Error


 Some more information, if anyone can help:

 If I turn fetcher.parse to false, then it successfully fetches and
crawls
 the site. and then bombs out with a larger ID for the job:

 2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
 valid local directory for

taskTracker/jobcache/job_local_0010/attempt_local_0010_m_00_0/output/spill0.out

 at
 org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
 at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124)
 at
 org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
(MapOutputFile.java:107)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill
(MapTask.java:930)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush
(MapTask.java:842)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run
(LocalJobRunner.java:138)

 So, it's gotta be a problem with the parsing? The pages should all be
 UTF-8, and I know there are multiple languages involved. I tried setting
 parser.character.encoding.default to match, but it made no difference.
I'd
 appreciate any ideas.

 [cid:1__=0ABBFD99DFE290498f9e8a93df938@us.ibm.com]Joshua J
 Pavel---04/16/2010 03:05:18 PM---fwiw, the error does seem to be valid:
from
 the taskTracker/jobcache directory, I only have somethin

 From:


 Joshua J Pavel/Raleigh/i...@ibmus


 To:


 nutch-user@lucene.apache.org


 Date:


 04/16

Re: Hadoop Disk Error

2010-04-20 Thread Joshua J Pavel
:  |
|
  
--|
  |Joshua J Pavel/Raleigh/i...@ibmus
  |
  
--|
|
| To:|
|
  
--|
  |nutch-user@lucene.apache.org 
 |
  
--|
|
| Date:  |
|
  
--|
  |04/20/2010 01:41 PM  
 |
  
--|
|
| Subject:   |
|
  
--|
  |Re: Hadoop Disk Error
 |
  
--|





Yes - how much free space does it need? We ran 0.9 using /tmp, and that has
~ 1 GB. After I first saw this error, I moved it to another filesystem
where I have 2 GB free (maybe not gigs and gigs, but more than I think I
need to complete a small test crawl?).

Julien Nioche ---04/20/2010 12:36:10 PM---Hi Joshua, The error message you
got definitely indicates that you are running out of
   
   
 From:Julien Nioche lists.digitalpeb...@gmail.com
   
   
 To:  nutch-user@lucene.apache.org 
   
   
 Date:04/20/2010 12:36 PM  
   
   
 Subject: Re: Hadoop Disk Error
   





Hi Joshua,

The error message you got definitely indicates that you are running out of
space.  Have you changed the value of hadoop.tmp.dir in the config file?

J.

--
DigitalPebble Ltd
http://www.digitalpebble.com.

On 20 April 2010 14:00, Joshua J Pavel jpa...@us.ibm.com wrote:

 I am - I changed the location to a filesystem with lots of free space and
 watched disk utilization during a crawl. It'll be a relatively small
crawl,
 and I have gigs and gigs free.

 [image: Inactive hide details for ---04/19/2010 05:53:53 PM---Are you
sure
 that you have enough space in the temporary directory used
b]---04/19/2010
 05:53:53 PM---Are you sure that you have enough space in the temporary
 directory used by Hadoop? From: Joshua J Pa


 From:
 arkadi.kosmy...@csiro.au
 To:
 nutch-user@lucene.apache.org
 Date:
 04/19/2010 05:53 PM
 Subject:
 RE: Hadoop Disk Error
 --



 Are you sure that you have enough space in the temporary directory used
by
 Hadoop?

 From: Joshua J Pavel [mailto:jpa...@us.ibm.com. jpa...@us.ibm.com]
 Sent: Tuesday, 20 April 2010 6:42 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Hadoop Disk Error


 Some more information, if anyone can help:

 If I turn fetcher.parse to false, then it successfully fetches and
crawls
 the site. and then bombs out with a larger ID for the job:

 2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
 valid local directory for

taskTracker/jobcache/job_local_0010/attempt_local_0010_m_00_0/output/spill0.out

 at
 org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
 at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124

RE: Hadoop Disk Error

2010-04-20 Thread Arkadi.Kosmynin
1 or even 2 GB are far from impressing. Why don't you switch hadoop.tmp.dir to 
a place with, say, 50GB free? Your task may be successful on Windows just 
because the temp space limit is different there.

From: Joshua J Pavel [mailto:jpa...@us.ibm.com]
Sent: Wednesday, 21 April 2010 3:40 AM
To: nutch-user@lucene.apache.org
Subject: Re: Hadoop Disk Error


Yes - how much free space does it need? We ran 0.9 using /tmp, and that has ~ 1 
GB. After I first saw this error, I moved it to another filesystem where I have 
2 GB free (maybe not gigs and gigs, but more than I think I need to complete 
a small test crawl?).

[cid:1__=0ABBFD98DFF359758f9e8a93df938@us.ibm.com]Julien Nioche ---04/20/2010 
12:36:10 PM---Hi Joshua, The error message you got definitely indicates that 
you are running out of

From:


Julien Nioche lists.digitalpeb...@gmail.com


To:


nutch-user@lucene.apache.org


Date:


04/20/2010 12:36 PM


Subject:


Re: Hadoop Disk Error





Hi Joshua,

The error message you got definitely indicates that you are running out of
space.  Have you changed the value of hadoop.tmp.dir in the config file?

J.

--
DigitalPebble Ltd
http://www.digitalpebble.com

On 20 April 2010 14:00, Joshua J Pavel jpa...@us.ibm.com wrote:

 I am - I changed the location to a filesystem with lots of free space and
 watched disk utilization during a crawl. It'll be a relatively small crawl,
 and I have gigs and gigs free.

 [image: Inactive hide details for ---04/19/2010 05:53:53 PM---Are you sure
 that you have enough space in the temporary directory used b]---04/19/2010
 05:53:53 PM---Are you sure that you have enough space in the temporary
 directory used by Hadoop? From: Joshua J Pa


 From:
 arkadi.kosmy...@csiro.au
 To:
 nutch-user@lucene.apache.org
 Date:
 04/19/2010 05:53 PM
 Subject:
 RE: Hadoop Disk Error
 --



 Are you sure that you have enough space in the temporary directory used by
 Hadoop?

 From: Joshua J Pavel [mailto:jpa...@us.ibm.com jpa...@us.ibm.com]
 Sent: Tuesday, 20 April 2010 6:42 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Hadoop Disk Error


 Some more information, if anyone can help:

 If I turn fetcher.parse to false, then it successfully fetches and crawls
 the site. and then bombs out with a larger ID for the job:

 2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
 valid local directory for
 taskTracker/jobcache/job_local_0010/attempt_local_0010_m_00_0/output/spill0.out
 at
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
 at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
 at
 org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:930)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:842)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

 So, it's gotta be a problem with the parsing? The pages should all be
 UTF-8, and I know there are multiple languages involved. I tried setting
 parser.character.encoding.default to match, but it made no difference. I'd
 appreciate any ideas.

 [cid:1__=0ABBFD99DFE290498f9e8a93df938@us.ibm.com]Joshua J
 Pavel---04/16/2010 03:05:18 PM---fwiw, the error does seem to be valid: from
 the taskTracker/jobcache directory, I only have somethin

 From:


 Joshua J Pavel/Raleigh/i...@ibmus


 To:


 nutch-user@lucene.apache.org


 Date:


 04/16/2010 03:05 PM


 Subject:


 Re: Hadoop Disk Error

 



 fwiw, the error does seem to be valid: from the taskTracker/jobcache
 directory, I only have something for job 1-4.

 ls -la
 total 0
 drwxr-xr-x 6 root system 256 Apr 16 19:01 .
 drwxr-xr-x 3 root system 256 Apr 16 19:01 ..
 drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0001
 drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0002
 drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0003
 drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0004

 Joshua J Pavel---04/16/2010 09:00:35 AM---We're just now moving from a
 nutch .9 installation to 1.0, so I'm not entirely new to this. However

 From:


 Joshua J Pavel/Raleigh/i...@ibmus


 To:


 nutch-user@lucene.apache.org


 Date:


 04/16/2010 09:00 AM


 Subject:


 Hadoop Disk Error

 





 We're just now moving from a nutch .9 installation to 1.0, so I'm not
 entirely new to this.  However, I can't even get past the first fetch now,
 due to a hadoop error.

 Looking in the mailing list archives, normally this error is caused from
 either permissions or a full disk.  I overrode the use of /tmp by setting
 hadoop.tmp.dir to a place with plenty of space, and I'm running

Re: Hadoop Disk Error

2010-04-19 Thread Joshua J Pavel

Some more information, if anyone can help:

If I turn fetcher.parse to false, then it successfully fetches and crawls
the site.  and then bombs out with a larger ID for the job:

2010-04-19 20:34:48,342 WARN  mapred.LocalJobRunner - job_local_0010
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_local_0010/attempt_local_0010_m_00_0/output/spill0.out
at org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill
(MapTask.java:930)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush
(MapTask.java:842)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run
(LocalJobRunner.java:138)

So, it's gotta be a problem with the parsing?  The pages should all be
UTF-8, and I know there are multiple languages involved.  I tried setting
parser.character.encoding.default to match, but it made no difference.  I'd
appreciate any ideas.


|
| From:  |
|
  
--|
  |Joshua J Pavel/Raleigh/i...@ibmus
  |
  
--|
|
| To:|
|
  
--|
  |nutch-user@lucene.apache.org 
 |
  
--|
|
| Date:  |
|
  
--|
  |04/16/2010 03:05 PM  
 |
  
--|
|
| Subject:   |
|
  
--|
  |Re: Hadoop Disk Error
 |
  
--|





fwiw, the error does seem to be valid: from the taskTracker/jobcache
directory, I only have something for job 1-4.

ls -la
total 0
drwxr-xr-x 6 root system 256 Apr 16 19:01 .
drwxr-xr-x 3 root system 256 Apr 16 19:01 ..
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0001
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0002
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0003
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0004

Joshua J Pavel---04/16/2010 09:00:35 AM---We're just now moving from a
nutch .9 installation to 1.0, so I'm not entirely new to this. However
   
   
 From:   Joshua J Pavel/Raleigh/i...@ibmus  
   
   
 To: nutch-user@lucene.apache.org  
   
   
 Date:   04/16/2010 09:00 AM   
   
   
 Subject:Hadoop Disk Error 
   







We're just now moving from a nutch .9 installation to 1.0, so I'm not
entirely new to this.  However, I can't even get past the first fetch now

RE: Hadoop Disk Error

2010-04-19 Thread Arkadi.Kosmynin
Are you sure that you have enough space in the temporary directory used by 
Hadoop?

From: Joshua J Pavel [mailto:jpa...@us.ibm.com]
Sent: Tuesday, 20 April 2010 6:42 AM
To: nutch-user@lucene.apache.org
Subject: Re: Hadoop Disk Error


Some more information, if anyone can help:

If I turn fetcher.parse to false, then it successfully fetches and crawls the 
site. and then bombs out with a larger ID for the job:

2010-04-19 20:34:48,342 WARN mapred.LocalJobRunner - job_local_0010
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for 
taskTracker/jobcache/job_local_0010/attempt_local_0010_m_00_0/output/spill0.out
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at 
org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:930)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:842)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

So, it's gotta be a problem with the parsing? The pages should all be UTF-8, 
and I know there are multiple languages involved. I tried setting 
parser.character.encoding.default to match, but it made no difference. I'd 
appreciate any ideas.

[cid:1__=0ABBFD99DFE290498f9e8a93df938@us.ibm.com]Joshua J Pavel---04/16/2010 
03:05:18 PM---fwiw, the error does seem to be valid: from the 
taskTracker/jobcache directory, I only have somethin

From:


Joshua J Pavel/Raleigh/i...@ibmus


To:


nutch-user@lucene.apache.org


Date:


04/16/2010 03:05 PM


Subject:


Re: Hadoop Disk Error





fwiw, the error does seem to be valid: from the taskTracker/jobcache directory, 
I only have something for job 1-4.

ls -la
total 0
drwxr-xr-x 6 root system 256 Apr 16 19:01 .
drwxr-xr-x 3 root system 256 Apr 16 19:01 ..
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0001
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0002
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0003
drwxr-xr-x 4 root system 256 Apr 16 19:01 job_local_0004

Joshua J Pavel---04/16/2010 09:00:35 AM---We're just now moving from a nutch .9 
installation to 1.0, so I'm not entirely new to this. However

From:


Joshua J Pavel/Raleigh/i...@ibmus


To:


nutch-user@lucene.apache.org


Date:


04/16/2010 09:00 AM


Subject:


Hadoop Disk Error







We're just now moving from a nutch .9 installation to 1.0, so I'm not
entirely new to this.  However, I can't even get past the first fetch now,
due to a hadoop error.

Looking in the mailing list archives, normally this error is caused from
either permissions or a full disk.  I overrode the use of /tmp by setting
hadoop.tmp.dir to a place with plenty of space, and I'm running the crawl
as root, yet I'm still getting the error below.

Any thoughts?

Running on AIX with plenty of disk and RAM.

2010-04-16 12:49:51,972 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-04-16 12:49:52,267 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-04-16 12:49:52,268 INFO  fetcher.Fetcher - -activeThreads=0,
2010-04-16 12:49:52,270 WARN  mapred.LocalJobRunner - job_local_0005
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_local_0005/attempt_local_0005_m_00_0/output/spill0.out
  at org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
  at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124)
  at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
(MapOutputFile.java:107)
  at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill
(MapTask.java:930)
  at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush
(MapTask.java:842)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
  at org.apache.hadoop.mapred.LocalJobRunner$Job.run
(LocalJobRunner.java:138)



Re: Hadoop Disk Error

2010-04-16 Thread Joshua J Pavel

fwiw, the error does seem to be valid: from the taskTracker/jobcache
directory, I only have something for job 1-4.

ls -la
total 0
drwxr-xr-x6 root system  256 Apr 16 19:01 .
drwxr-xr-x3 root system  256 Apr 16 19:01 ..
drwxr-xr-x4 root system  256 Apr 16 19:01 job_local_0001
drwxr-xr-x4 root system  256 Apr 16 19:01 job_local_0002
drwxr-xr-x4 root system  256 Apr 16 19:01 job_local_0003
drwxr-xr-x4 root system  256 Apr 16 19:01 job_local_0004


|
| From:  |
|
  
--|
  |Joshua J Pavel/Raleigh/i...@ibmus
  |
  
--|
|
| To:|
|
  
--|
  |nutch-user@lucene.apache.org 
 |
  
--|
|
| Date:  |
|
  
--|
  |04/16/2010 09:00 AM  
 |
  
--|
|
| Subject:   |
|
  
--|
  |Hadoop Disk Error
 |
  
--|







We're just now moving from a nutch .9 installation to 1.0, so I'm not
entirely new to this.  However, I can't even get past the first fetch now,
due to a hadoop error.

Looking in the mailing list archives, normally this error is caused from
either permissions or a full disk.  I overrode the use of /tmp by setting
hadoop.tmp.dir to a place with plenty of space, and I'm running the crawl
as root, yet I'm still getting the error below.

Any thoughts?

Running on AIX with plenty of disk and RAM.

2010-04-16 12:49:51,972 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-04-16 12:49:52,267 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-04-16 12:49:52,268 INFO  fetcher.Fetcher - -activeThreads=0,
2010-04-16 12:49:52,270 WARN  mapred.LocalJobRunner - job_local_0005
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for
taskTracker/jobcache/job_local_0005/attempt_local_0005_m_00_0/output/spill0.out

at org.apache.hadoop.fs.LocalDirAllocator
$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite
(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite
(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill
(MapTask.java:930)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush
(MapTask.java:842)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run
(LocalJobRunner.java:138)