Currently trying to simply crawl a folder and index the contents (Word Docs,
HTML Files, etc..) but no matter how it is configured I get the error:

fetch of file://C:/TestFiles/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file

Here is the current set up with notes below:

-- nutch-site.xml --
<configuration>
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

<property>
 <name>http.robots.agents</name>
 <value>My Nutch Spider</value>
</property>


<property>
  <name>plugin.auto-activation</name>
  <value>true</value>
</property>

<property>
 <name>plugin.includes</name> 

<value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>


<property>
 <name>file.content.limit</name>
 <value>-1</value>
</property> 

<property>
  <name>file.crawl.parent</name>
  <value>true</value>
</property>

</configuration>

I have also tried leaving out protocol-http, adding and subtracting
different extensions from the "parse-" section.  I have tried any value I
could find by searching on Google as well.

I have stripped the file down to the bare minimum and still no change.

--regex-urlfilter.xml--
+.

I have basically removed anything that filters anything here.  It should not
be blocking anything anymore.



-- seed.txt --
file:///C:/TestFiles/

I have also tried every configuration of /cygdrive/c, file://, leaving the
trailing / off and on, *.*, http://file://, which doesnt make sense any
ways.  Relative and absolute paths both.  


Now, I have looked at every post on Nabble, every blog post (a lot of this
is outdated anyways), API docs.  I am even looking in the IRC channel for
help.  If it were fixable by Googling the error it would have been fixed
already.

There has got to be something about the Cygwin set up th at I am missing. 
It seems to Crawl and fetch websites fine.

Thanks in advance for any help.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Crawling-Local-Files-within-Cygwin-tp3712116p3712116.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to