Re: Error with Hadoop-0.4.0

2006-07-12 Thread Sami Siren

Gal Nitzan wrote:


To get the same behavior, just try to inject to a new crawldb that doesn't
exist.

The reason many doesn't get it is that crawldb already exists in their
environment.


 


true, I was injecting to existing crawldb.

--
Sami Siren


Re: Error with Hadoop-0.4.0

2006-07-12 Thread Sami Siren

Doug Cutting wrote:


 Jérôme Charron wrote:

 In my environment, the crawl command terminate with the following
 error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient
 (JobClient.java:submitJob(273)) - Input directory
 /localpath/crawl/crawldb/current in local is invalid. Exception in
 thread main java.io.IOException: Input directory
 /localpathcrawl/crawldb/current in local is invalid. at
 org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at
 org.apache.nutch.crawl.Injector.inject(Injector.java:146) at
 org.apache.nutch.crawl.Crawl.main(Crawl.java:105)


 Hadoop 0.4.0 by default requires all input directories to exist,
 where previous releases did not. So we need to either create an
 empty current directory or change the InputFormat used in
 CrawlDb.createJob() to be one that overrides
 InputFormat.areValidInputDirectories(). The former is probably
 easier. I've attached a patch. Does this fix things for folks?



Patch works for me.
--
Sami Siren



Re: Error with Hadoop-0.4.0

2006-07-12 Thread Doug Cutting

Sami Siren wrote:

Patch works for me.


OK.  I just committed it.

Thanks!

Doug


Re: Error with Hadoop-0.4.0

2006-07-10 Thread Andrzej Bialecki

Stefan Groschupf wrote:

We tried your suggested fix:

Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))


I suspect that this is not the right solution - have you actually tested 
that the resulting db contains all entries from the input dirs?


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




RE: Error with Hadoop-0.4.0

2006-07-10 Thread Gal Nitzan
To get the same behavior, just try to inject to a new crawldb that doesn't
exist.

The reason many doesn't get it is that crawldb already exists in their
environment.



-Original Message-
From: Sami Siren [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 06, 2006 7:23 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Error with Hadoop-0.4.0

Jérôme Charron wrote:

 Hi,

 I encountered some problems with Nutch trunk version.
 In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
 1.5
 (more precisely since HADOOP-129 and File replacement by Path).
 Does somebody have the same error?

I am not seeing this (just run inject on a single machine(linux) 
configuration, local fs without problems ).

--
 Sami Siren




Re: Error with Hadoop-0.4.0

2006-07-10 Thread Doug Cutting

Jérôme Charron wrote:

In my environment, the crawl command terminate with the following error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient 
(JobClient.java:submitJob(273))

- Input directory /localpath/crawl/crawldb/current in local is invalid.
Exception in thread main java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
   at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)


Hadoop 0.4.0 by default requires all input directories to exist, where 
previous releases did not.  So we need to either create an empty 
current directory or change the InputFormat used in 
CrawlDb.createJob() to be one that overrides 
InputFormat.areValidInputDirectories().  The former is probably easier. 
 I've attached a patch.  Does this fix things for folks?


Doug
Index: src/java/org/apache/nutch/crawl/CrawlDb.java
===
--- src/java/org/apache/nutch/crawl/CrawlDb.java	(revision 417882)
+++ src/java/org/apache/nutch/crawl/CrawlDb.java	(working copy)
@@ -65,7 +65,8 @@
 if (LOG.isInfoEnabled()) { LOG.info(CrawlDb update: done); }
   }
 
-  public static JobConf createJob(Configuration config, Path crawlDb) {
+  public static JobConf createJob(Configuration config, Path crawlDb)
+throws IOException {
 Path newCrawlDb =
   new Path(crawlDb,
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
@@ -73,7 +74,11 @@
 JobConf job = new NutchJob(config);
 job.setJobName(crawldb  + crawlDb);
 
-job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
+
+Path current = new Path(crawlDb, CrawlDatum.DB_DIR_NAME);
+if (FileSystem.get(job).exists(current)) {
+  job.addInputPath(current);
+}
 job.setInputFormat(SequenceFileInputFormat.class);
 job.setInputKeyClass(UTF8.class);
 job.setInputValueClass(CrawlDatum.class);


Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf

Hi Jérôme,

I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.
We should fix that.

Stefan

On 06.07.2006, at 08:54, Jérôme Charron wrote:


Hi,

I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0  
and JDK

1.5
(more precisely since HADOOP-129 and File replacement by Path).

In my environment, the crawl command terminate with the following  
error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient  
(JobClient.java:submitJob(273))
- Input directory /localpath/crawl/crawldb/current in local is  
invalid.

Exception in thread main java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
   at org.apache.hadoop.mapred.JobClient.submitJob 
(JobClient.java:274)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
327)

   at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

By looking at the Nutch code, and simply changing the line 145 of  
Injector

by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
all is working fine. By taking a closer look at CrawlDb code, I  
finaly dont

understand why there is the following line in the createJob method:
job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));

For curiosity, if a hadoop guru can explain why there is such a
regression...

Does somebody have the same error?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/




Re: Error with Hadoop-0.4.0

2006-07-07 Thread Jérôme Charron

I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.


Thanks for this feedback Stefan.



We should fix that.


What I suggest, is simply to remove the line 75 in createJob method from
CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In fact, this method is only used by Injector.inject() and CrawlDb.update()
and
the inputPath setted in createJob is not needed neither by Injector.inject()
nor
CrawlDb.update() methods.

If no objection, I will commit this change tomorrow.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Error with Hadoop-0.4.0

2006-07-07 Thread Stefan Groschupf

We tried your suggested fix:

Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))


and this worked without any problem.

Thanks for catching that, this saved us a lot of time.
Stefan

On 07.07.2006, at 16:08, Jérôme Charron wrote:


I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.


Thanks for this feedback Stefan.



We should fix that.


What I suggest, is simply to remove the line 75 in createJob method  
from

CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In fact, this method is only used by Injector.inject() and  
CrawlDb.update()

and
the inputPath setted in createJob is not needed neither by  
Injector.inject()

nor
CrawlDb.update() methods.

If no objection, I will commit this change tomorrow.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/




Re: Error with Hadoop-0.4.0

2006-07-06 Thread Sami Siren

Jérôme Charron wrote:


Hi,

I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
1.5
(more precisely since HADOOP-129 and File replacement by Path).
Does somebody have the same error?


I am not seeing this (just run inject on a single machine(linux) 
configuration, local fs without problems ).


--
Sami Siren