Hello

I'm getting the following exception while indexing my site which are hosted
in my local machine.


Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
-^(file|ftp|mailto):
Skipping
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$:java.net.MalformedURLException:
no protocol:
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$
Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
+^http://women.net/*
Skipping -.:java.net.MalformedURLException: no protocol: -.
Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
-^(file|ftp|mailto):
Skipping
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV
)$:java.net.MalformedURLException: no protocol:
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF
|pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV)$

Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
+^http://women.net/*

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/C:/Nutch/local/crawl/segments/20120905010233/parse_data
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)


I'd add the following line in conf\regex-urlfilter.txt

#+^http://([a-z0-9]*\.)*women.com/
+http://women.net/*

and in conf\crawl-urlfilter.txt
 # accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+http://women.net/*

Please let me what else i need to do in order to index the data.

Thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p4005547.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to