Hello I'm getting the following exception while indexing my site which are hosted in my local machine.
Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol: -^(file|ftp|mailto): Skipping -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$:java.net.MalformedURLException: no protocol: -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$ Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol: +^http://women.net/* Skipping -.:java.net.MalformedURLException: no protocol: -. Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol: -^(file|ftp|mailto): Skipping -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV )$:java.net.MalformedURLException: no protocol: -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF |pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV)$ Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol: +^http://women.net/* LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Nutch/local/crawl/segments/20120905010233/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) I'd add the following line in conf\regex-urlfilter.txt #+^http://([a-z0-9]*\.)*women.com/ +http://women.net/* and in conf\crawl-urlfilter.txt # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ +http://women.net/* Please let me what else i need to do in order to index the data. Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p4005547.html Sent from the Nutch - User mailing list archive at Nabble.com.

