Hello,

I am trying to change my setup to use hdfs. I am getting an error when
injecting.

----- Inject (Step 1 of 12) -----
/opt/nutch/bin/nutch inject
hdfs://localhost:9000/cluster/nutchnew/crawl/crawldb
hdfs://localhost:9000/urls/urls
/opt/nutch:/opt/nutch/conf:/usr/java/latest/lib/tools.jar:/opt/nutch/build/plugins:/opt/nutch/build/nutch-*.job:/opt/nutch/nutch-1.2.job:/opt/nutch/lib/apache-solr-core-1.4.0.jar:/opt/nutch/lib/apache-solr-solrj-1.4.0.jar:/opt/nutch/lib/commons-beanutils-1.8.0.jar:/opt/nutch/lib/commons-cli-1.2.jar:/opt/nutch/lib/commons-codec-1.3.jar:/opt/nutch/lib/commons-collections-3.2.1.jar:/opt/nutch/lib/commons-el-1.0.jar:/opt/nutch/lib/commons-httpclient-3.1.jar:/opt/nutch/lib/commons-io-1.4.jar:/opt/nutch/lib/commons-lang-2.1.jar:/opt/nutch/lib/commons-logging-1.0.4.jar:/opt/nutch/lib/commons-logging-api-1.0.4.jar:/opt/nutch/lib/commons-net-1.4.1.jar:/opt/nutch/lib/core-3.1.1.jar:/opt/nutch/lib/geronimo-stax-api_1.0_spec-1.0.1.jar:/opt/nutch/lib/hadoop-0.20.2-core.jar:/opt/nutch/lib/hadoop-0.20.2-tools.jar:/opt/nutch/lib/hsqldb-1.8.0.10.jar:/opt/nutch/lib/icu4j-4_0_1.jar:/opt/nutch/lib/jakarta-oro-2.0.8.jar:/opt/nutch/lib/jasper-compiler-5.5.12.jar:/opt/nutch/lib/jasper-runtime-5.5.12.jar:/opt/nutch/lib/jcl-over-slf4j-1.5.5.jar:/opt/nutch/lib/jets3t-0.6.1.jar:/opt/nutch/lib/jetty-6.1.14.jar:/opt/nutch/lib/jetty-util-6.1.14.jar:/opt/nutch/lib/junit-3.8.1.jar:/opt/nutch/lib/kfs-0.2.2.jar:/opt/nutch/lib/log4j-1.2.15.jar:/opt/nutch/lib/lucene-core-2.9.3.jar:/opt/nutch/lib/lucene-misc-2.9.3.jar:/opt/nutch/lib/oro-2.0.8.jar:/opt/nutch/lib/resolver.jar:/opt/nutch/lib/serializer.jar:/opt/nutch/lib/servlet-api-2.5-6.1.14.jar:/opt/nutch/lib/slf4j-api-1.5.5.jar:/opt/nutch/lib/slf4j-log4j12-1.4.3.jar:/opt/nutch/lib/taglibs-i18n.jar:/opt/nutch/lib/tika-core-0.7.jar:/opt/nutch/lib/wstx-asl-3.2.7.jar:/opt/nutch/lib/xercesImpl.jar:/opt/nutch/lib/xml-apis.jar:/opt/nutch/lib/xmlenc-0.52.jar:/opt/nutch/lib/jsp-2.1/jsp-2.1.jar:/opt/nutch/lib/jsp-2.1/jsp-api-2.1.jar
Injector: starting at 2010-12-08 13:02:42
Injector: crawlDb: inject
Injector: urlDir: hdfs://localhost:9000/cluster/nutchnew/crawl/crawldb
Injector: Converting injected urls to crawl db entries.
Injector: java.io.IOException: Not a file:
hdfs://localhost:9000/cluster/nutchnew/crawl/crawldb/current
    at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:206)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
    at org.apache.nutch.crawl.Injector.run(Injector.java:248)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:238)


I don't understand the "not a file" error. There is a current directory
under the crawldb normally. Here is my normal directory structure.

glassf...@search2 nutch]$ ls -l /search/database/nutchnew/crawl/crawldb/
total 8
drwxr-sr-x 4 glassfish gfadmin 1024 Dec  7 23:40 current
[glassf...@search2 nutch]$ ls -l
/search/database/nutchnew/crawl/crawldb/current/
total 16
drwxr-sr-x 2 glassfish gfadmin 1024 Dec  7 23:40 part-00000
drwxr-sr-x 2 glassfish gfadmin 1024 Dec  7 23:40 part-00001


Why would it make a difference if it is hdfs?

The command I normally run is /opt/nutch/bin/nutch inject
/search/database/nutchnew/crawl/crawldb/ /opt/nutch/urls/urls how is this
any different?

and I just tried it manually instead of as part of a script and it ran fine
now.

[glassf...@search2 nutch]$ /opt/nutch/bin/nutch inject
hdfs://localhost:9000/cluster/nutchnew/crawl/crawldb
hdfs://localhost:9000/urls/urls
/opt/nutch:/opt/nutch/conf:/usr/java/latest/lib/tools.jar:/opt/nutch/build/plugins:/opt/nutch/build/nutch-*.job:/opt/nutch/nutch-1.2.job:/opt/nutch/lib/apache-solr-core-1.4.0.jar:/opt/nutch/lib/apache-solr-solrj-1.4.0.jar:/opt/nutch/lib/commons-beanutils-1.8.0.jar:/opt/nutch/lib/commons-cli-1.2.jar:/opt/nutch/lib/commons-codec-1.3.jar:/opt/nutch/lib/commons-collections-3.2.1.jar:/opt/nutch/lib/commons-el-1.0.jar:/opt/nutch/lib/commons-httpclient-3.1.jar:/opt/nutch/lib/commons-io-1.4.jar:/opt/nutch/lib/commons-lang-2.1.jar:/opt/nutch/lib/commons-logging-1.0.4.jar:/opt/nutch/lib/commons-logging-api-1.0.4.jar:/opt/nutch/lib/commons-net-1.4.1.jar:/opt/nutch/lib/core-3.1.1.jar:/opt/nutch/lib/geronimo-stax-api_1.0_spec-1.0.1.jar:/opt/nutch/lib/hadoop-0.20.2-core.jar:/opt/nutch/lib/hadoop-0.20.2-tools.jar:/opt/nutch/lib/hsqldb-1.8.0.10.jar:/opt/nutch/lib/icu4j-4_0_1.jar:/opt/nutch/lib/jakarta-oro-2.0.8.jar:/opt/nutch/lib/jasper-compiler-5.5.12.jar:/opt/nutch/lib/jasper-runtime-5.5.12.jar:/opt/nutch/lib/jcl-over-slf4j-1.5.5.jar:/opt/nutch/lib/jets3t-0.6.1.jar:/opt/nutch/lib/jetty-6.1.14.jar:/opt/nutch/lib/jetty-util-6.1.14.jar:/opt/nutch/lib/junit-3.8.1.jar:/opt/nutch/lib/kfs-0.2.2.jar:/opt/nutch/lib/log4j-1.2.15.jar:/opt/nutch/lib/lucene-core-2.9.3.jar:/opt/nutch/lib/lucene-misc-2.9.3.jar:/opt/nutch/lib/oro-2.0.8.jar:/opt/nutch/lib/resolver.jar:/opt/nutch/lib/serializer.jar:/opt/nutch/lib/servlet-api-2.5-6.1.14.jar:/opt/nutch/lib/slf4j-api-1.5.5.jar:/opt/nutch/lib/slf4j-log4j12-1.4.3.jar:/opt/nutch/lib/taglibs-i18n.jar:/opt/nutch/lib/tika-core-0.7.jar:/opt/nutch/lib/wstx-asl-3.2.7.jar:/opt/nutch/lib/xercesImpl.jar:/opt/nutch/lib/xml-apis.jar:/opt/nutch/lib/xmlenc-0.52.jar:/opt/nutch/lib/jsp-2.1/jsp-2.1.jar:/opt/nutch/lib/jsp-2.1/jsp-api-2.1.jar
Injector: starting at 2010-12-08 13:32:42
Injector: crawlDb: hdfs://localhost:9000/cluster/nutchnew/crawl/crawldb
Injector: urlDir: hdfs://localhost:9000/urls/urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2010-12-08 13:33:36, elapsed: 00:00:54

That's weird.

Any thoughts?

Steve

Reply via email to