I have been attempting to get the mapred branch version of the crawler
working and have hit some snags.
First, I have observed the same behavior as a previous poster from
yesterday who, instead of specifying a file for the URLs to be read
from, must now specify a directory (full path) to which a file
containing the URL list is stored. From the response to that thread I
am gathering that it isn't desired behavior to specify a directory
instead of a file.
Second, and more importantly, I am having issues with task trackers. I
have three machines running task tracker, and a fourth running the job
tracker, and they seem to be talking well. Whenever I try to invoke
crawl using the job tracker, however, all of my task trackers
continually fail with this:
050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
[Fatal Error] :-1:-1: Premature end of file.
050816 134532 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
end of file.
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
at
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
at
org.apache.nutch.mapred.TaskTracker$TaskInProgress.init(TaskTracker.java:319)
at
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
... 8 more
Whenever I look at the job.xml file specified by this location, it
turns out that it is a directory, not a file.
drwxrwxr-x 2 jeremy users 4096 Aug 16 13:45 job.xml
Any help / observation of these issues is most appreciated.
Thanks,
Jeremy