using the regex url filter plugin you can for example only pass http:// urls.

+http://

On Friday 16 December 2011 16:09:00 mina wrote:
> thanks for your answer, how i set up proper URL filters?
> 
> On Fri, Dec 16, 2011 at 3:42 AM, Markus Jelsma-2 [via Lucene] <
> 
> [email protected]> wrote:
> > You haven't set up proper URL filters. You'd typically have URL filters
> > that
> > only pass the protocol's you need.
> > 
> > On Thursday 15 December 2011 23:48:50 mina wrote:
> > > i crawl sites with nutch 1.3. i see this exception in my log when nutch
> > > 
> > > crawl my sites:
> > >     Malformed URL: '', skipping (java.net.MalformedURLException: no
> > > 
> > > protocol:
> > > at java.net.URL.<init>(URL.java:567)
> > > at java.net.URL.<init>(URL.java:464)
> > > at java.net.URL.<init>(URL.java:413)
> > > at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:247)
> > > at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109)
> > > at
> > 
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> > 
> > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> > > at
> > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216
> > > )
> > 
> > )
> > 
> > > --
> > 
> > > View this message in context:
> > http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-Malfor
> > m
> > 
> > > edURLException-tp3590159p3590159.html Sent from the Nutch - User
> > > mailing list archive at Nabble.com.
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > 
> > 
> > ------------------------------
> > 
> >  If you reply to this email, your message will be added to the discussion
> > 
> > below:
> > 
> > http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-Malfor
> > medURLException-tp3590159p3591381.html
> > 
> >  To unsubscribe from Malformed URL: '', skipping
> > 
> > (java.net.MalformedURLException, click
> > here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=u
> > nsubscribe_by_code&node=3590159&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDM1
> > OTAxNTl8NTgyODE5NjA3> .
> > NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=ma
> > cro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespa
> > ces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.w
> > eb.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aem
> > ail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble
> > %3Aemail.naml>
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-Malform
> edURLException-tp3590159p3591831.html Sent from the Nutch - User mailing
> list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex

Reply via email to