Re: Malformed URL: '', skipping (java.net.MalformedURLException

Lewis John Mcgibbney Wed, 05 Sep 2012 07:08:50 -0700

I think you've incorrectly passed your regex- as your seed URL list
when you've injected.


As a side note it is always VERY helpful to provide basic info such as
the Nutch version, the steps you took to reproduce the error, etc...
basic stuff.

hth

Lewis

On Wed, Sep 5, 2012 at 10:16 AM, gaurav.gupta
<[email protected]> wrote:
> Hello
>
> I'm getting the following exception while indexing my site which are hosted
> in my local machine.
>
>
> Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
> -^(file|ftp|mailto):
> Skipping
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$:java.net.MalformedURLException:
> no protocol:
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$
> Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
> +^http://women.net/*
> Skipping -.:java.net.MalformedURLException: no protocol: -.
> Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
> -^(file|ftp|mailto):
> Skipping
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV
> )$:java.net.MalformedURLException: no protocol:
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF
> |pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV)$
>
> Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
> +^http://women.net/*
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/C:/Nutch/local/crawl/segments/20120905010233/parse_data
>         at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>         at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>         at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>         at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>         at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>         at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
>
> I'd add the following line in conf\regex-urlfilter.txt
>
> #+^http://([a-z0-9]*\.)*women.com/
> +http://women.net/*
>
> and in conf\crawl-urlfilter.txt
>  # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +http://women.net/*
>
> Please let me what else i need to do in order to index the data.
>
> Thanks in advance
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p4005547.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Reply via email to