using the regex url filter plugin you can for example only pass http:// urls.
+http:// On Friday 16 December 2011 16:09:00 mina wrote: > thanks for your answer, how i set up proper URL filters? > > On Fri, Dec 16, 2011 at 3:42 AM, Markus Jelsma-2 [via Lucene] < > > [email protected]> wrote: > > You haven't set up proper URL filters. You'd typically have URL filters > > that > > only pass the protocol's you need. > > > > On Thursday 15 December 2011 23:48:50 mina wrote: > > > i crawl sites with nutch 1.3. i see this exception in my log when nutch > > > > > > crawl my sites: > > > Malformed URL: '', skipping (java.net.MalformedURLException: no > > > > > > protocol: > > > at java.net.URL.<init>(URL.java:567) > > > at java.net.URL.<init>(URL.java:464) > > > at java.net.URL.<init>(URL.java:413) > > > at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:247) > > > at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109) > > > at > > > > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) > > > > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) > > > at > > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216 > > > ) > > > > ) > > > > > -- > > > > > View this message in context: > > http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-Malfor > > m > > > > > edURLException-tp3590159p3590159.html Sent from the Nutch - User > > > mailing list archive at Nabble.com. > > > > -- > > Markus Jelsma - CTO - Openindex > > > > > > ------------------------------ > > > > If you reply to this email, your message will be added to the discussion > > > > below: > > > > http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-Malfor > > medURLException-tp3590159p3591381.html > > > > To unsubscribe from Malformed URL: '', skipping > > > > (java.net.MalformedURLException, click > > here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=u > > nsubscribe_by_code&node=3590159&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDM1 > > OTAxNTl8NTgyODE5NjA3> . > > NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=ma > > cro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespa > > ces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.w > > eb.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aem > > ail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble > > %3Aemail.naml> > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-Malform > edURLException-tp3590159p3591831.html Sent from the Nutch - User mailing > list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex

