Re: IOException in dedup

2009-06-03 Thread Nic M
I used the patch and everything seems to be working fine at the  
moment. Thanks Dogacan.


Nic M

On Jun 3, 2009, at 12:07 PM, Doğacan Güney wrote:


On Tue, Jun 2, 2009 at 20:13, Nic M  wrote:

On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:


Hello,

I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse  
for Mac OS X. When I try to start crawling I get the following  
exception:


Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
at  
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at  
org 
.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java: 
439)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)


Does anyone know how to solve this problem?





You may be running into this problem:

https://issues.apache.org/jira/browse/NUTCH-525

I suggest trying updating to 1.0 or applying the patch there.



You can get an IOException reported by Hadoop when the root cause  
is that you've run out of memory. Normally the hadoop.log file  
would have the OOM exception.


If you're running from inside of Eclipse, see http://wiki.apache.org/nutch/RunNutchInEclipse0.9 
 for more details.


-- Ken
--
Ken Krugler
+1 530-210-6378


Thank you for the pointers Ken. I changed the VM memory parameters  
as shown at http://wiki.apache.org/nutch/RunNutchInEclipse0.9.  
However, I still get the exception and in Hadoop log I have the  
following exception


2009-06-02 13:08:18,790 INFO  indexer.DeleteDuplicates - Dedup:  
starting
2009-06-02 13:08:18,817 INFO  indexer.DeleteDuplicates - Dedup:  
adding indexes in: crawl/indexes

2009-06-02 13:08:19,064 WARN  mapred.LocalJobRunner - job_7izmuc
java.lang.ArrayIndexOutOfBoundsException: -1
	at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java: 
113)
	at org.apache.nutch.indexer.DeleteDuplicates$InputFormat 
$DDRecordReader.next(DeleteDuplicates.java:176)

at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
	at org.apache.hadoop.mapred.LocalJobRunner 
$Job.run(LocalJobRunner.java:126)


I am running Lucene 2.1.0. Any idea why I am getting the  
ArrayIndexOutofBoundsEception?


Nic






--
Doğacan Güney




Re: IOException in dedup

2009-06-03 Thread Doğacan Güney
On Tue, Jun 2, 2009 at 20:13, Nic M  wrote:

>
> On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:
>
> Hello,
>
>
> I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for Mac OS
> X. When I try to start crawling I get the following exception:
>
>
> Dedup: starting
>
> Dedup: adding indexes in: crawl/indexes
>
> Exception in thread "main" java.io.IOException: Job failed!
>
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>
> at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
>
>
> Does anyone know how to solve this problem?
>
>
You may be running into this problem:

https://issues.apache.org/jira/browse/NUTCH-525

I suggest trying updating to 1.0 or applying the patch there.


>
> You can get an IOException reported by Hadoop when the root cause is that
> you've run out of memory. Normally the hadoop.log file would have the OOM
> exception.
>
> If you're running from inside of Eclipse, see
> http://wiki.apache.org/nutch/RunNutchInEclipse0.9 for more details.
>
> -- Ken
>
> --
>
> Ken Krugler
> +1 530-210-6378
>
>
> Thank you for the pointers Ken. I changed the VM memory parameters as shown
> at http://wiki.apache.org/nutch/RunNutchInEclipse0.9. However, I still get
> the exception and in Hadoop log I have the following exception
>
> 2009-06-02 13:08:18,790 INFO  indexer.DeleteDuplicates - Dedup: starting
> 2009-06-02 13:08:18,817 INFO  indexer.DeleteDuplicates - Dedup: adding
> indexes in: crawl/indexes
> 2009-06-02 13:08:19,064 WARN  mapred.LocalJobRunner - job_7izmuc
> java.lang.ArrayIndexOutOfBoundsException: -1
> at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
>  at
> org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
> at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>  at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
>
> I am running Lucene 2.1.0. Any idea why I am getting the
> ArrayIndexOutofBoundsEception?
>
> Nic
>
>
>
>


-- 
Doğacan Güney


Re: IOException in dedup

2009-06-02 Thread MyD
I had the same problem when I forgot to add the URL field in the  
index. Maybe u have the same problem.


Regards,
MyD


On Jun 3, 2009, at 1:13 AM, Nic M wrote:



On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:


Hello,

I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse  
for Mac OS X. When I try to start crawling I get the following  
exception:


Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
at  
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at  
org 
.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java: 
439)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)


Does anyone know how to solve this problem?


You can get an IOException reported by Hadoop when the root cause  
is that you've run out of memory. Normally the hadoop.log file  
would have the OOM exception.


If you're running from inside of Eclipse, see http://wiki.apache.org/nutch/RunNutchInEclipse0.9 
 for more details.


-- Ken
--
Ken Krugler
+1 530-210-6378


Thank you for the pointers Ken. I changed the VM memory parameters  
as shown at http://wiki.apache.org/nutch/RunNutchInEclipse0.9.  
However, I still get the exception and in Hadoop log I have the  
following exception


2009-06-02 13:08:18,790 INFO  indexer.DeleteDuplicates - Dedup:  
starting
2009-06-02 13:08:18,817 INFO  indexer.DeleteDuplicates - Dedup:  
adding indexes in: crawl/indexes

2009-06-02 13:08:19,064 WARN  mapred.LocalJobRunner - job_7izmuc
java.lang.ArrayIndexOutOfBoundsException: -1
	at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java: 
113)
	at org.apache.nutch.indexer.DeleteDuplicates$InputFormat 
$DDRecordReader.next(DeleteDuplicates.java:176)

at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
	at org.apache.hadoop.mapred.LocalJobRunner 
$Job.run(LocalJobRunner.java:126)


I am running Lucene 2.1.0. Any idea why I am getting the  
ArrayIndexOutofBoundsEception?


Nic







Re: IOException in dedup

2009-06-02 Thread Ken Krugler

On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:


Hello,


I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse 
for Mac OS X. When I try to start crawling I get the following 
exception:



Dedup: starting

Dedup: adding indexes in: crawl/indexes

Exception in thread "main" java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

	at 
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)


at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)



Does anyone know how to solve this problem? 



You can get an IOException reported by Hadoop when the root cause 
is that you've run out of memory. Normally the hadoop.log file 
would have the OOM exception.


If you're running from inside of Eclipse, 
see http://wiki.apache.org/nutch/RunNutchInEclipse0.9 for 
more details.


-- Ken
--
Ken Krugler
+1 530-210-6378



Thank you for the pointers Ken. I changed the VM memory parameters 
as shown 
at http://wiki.apache.org/nutch/RunNutchInEclipse0.9. 
However, I still get the exception and in Hadoop log I have the 
following exception


2009-06-02 13:08:18,790 INFO  indexer.DeleteDuplicates - Dedup: starting
2009-06-02 13:08:18,817 INFO  indexer.DeleteDuplicates - Dedup: 
adding indexes in: crawl/indexes

2009-06-02 13:08:19,064 WARN  mapred.LocalJobRunner - job_7izmuc
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
	at 
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)

at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
	at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)


I am running Lucene 2.1.0. Any idea why I am getting the 
ArrayIndexOutofBoundsEception?


Most likely is that the index has been corrupted. If you can, try 
opening it using Luke.


-- Ken
--
Ken Krugler
+1 530-210-6378

Re: IOException in dedup

2009-06-02 Thread Nic M


On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote:


Hello,

I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for  
Mac OS X. When I try to start crawling I get the following exception:


Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
604)
at  
org 
.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java: 
439)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)


Does anyone know how to solve this problem?


You can get an IOException reported by Hadoop when the root cause is  
that you've run out of memory. Normally the hadoop.log file would  
have the OOM exception.


If you're running from inside of Eclipse, see http://wiki.apache.org/nutch/RunNutchInEclipse0.9 
 for more details.


-- Ken
--
Ken Krugler
+1 530-210-6378


Thank you for the pointers Ken. I changed the VM memory parameters as  
shown at http://wiki.apache.org/nutch/RunNutchInEclipse0.9. However, I  
still get the exception and in Hadoop log I have the following exception


2009-06-02 13:08:18,790 INFO  indexer.DeleteDuplicates - Dedup: starting
2009-06-02 13:08:18,817 INFO  indexer.DeleteDuplicates - Dedup: adding  
indexes in: crawl/indexes

2009-06-02 13:08:19,064 WARN  mapred.LocalJobRunner - job_7izmuc
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
	at org.apache.nutch.indexer.DeleteDuplicates$InputFormat 
$DDRecordReader.next(DeleteDuplicates.java:176)

at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
	at org.apache.hadoop.mapred.LocalJobRunner 
$Job.run(LocalJobRunner.java:126)


I am running Lucene 2.1.0. Any idea why I am getting the  
ArrayIndexOutofBoundsEception?


Nic





Re: IOException in dedup

2009-06-02 Thread Ken Krugler

Hello,

I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for 
Mac OS X. When I try to start crawling I get the following exception:


Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
	at 
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)


Does anyone know how to solve this problem? 


You can get an IOException reported by Hadoop when the root cause is 
that you've run out of memory. Normally the hadoop.log file would 
have the OOM exception.


If you're running from inside of Eclipse, see 
http://wiki.apache.org/nutch/RunNutchInEclipse0.9 for more details.


-- Ken
--
Ken Krugler
+1 530-210-6378

IOException in dedup

2009-06-02 Thread Nic M

Hello,

I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for  
Mac OS X. When I try to start crawling I get the following exception:


Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
	at  
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java: 
439)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)


Does anyone know how to solve this problem?

Nic M