Re: Suffix URLFilter not working

2013-06-13 Thread Peter Gaines

My apologies I sent it in error.
I have resent the email as a new thread
BTW I specified the full protocol i.e. http://

Regards,
Peter

-Original Message- 
From: Sebastian Nagel

Sent: Wednesday, June 12, 2013 8:54 PM
To: user@nutch.apache.org
Subject: Re: Suffix URLFilter not working

Hi Peter,

please do not hijack threads.

Seed URLs must be fully specified including protocol, e.g.:
http://nutch.apache.org/
but not
apache.org

Sebastian

On 06/12/2013 05:08 PM, Peter Gaines wrote:
I have installed version 2.2 of nutch on a CentIOS machine and am using 
the following command:


./bin/crawl urls testcrawl solrfolder 2

I have attempted to use the default filter configuration and also 
explicitly specified urlfilter-regex

in the nutch-default.xml (without modifying the default regex filters).

However if fails each time and I can see the exception below in the 
hadoop.log.


As you can see it looks like it has not picked up anything from the 
seed.txt in the urls folder

(as MalformedURLException error usually prints the url).
This file has 1 entry with the protocol specified e.g. 
http://www.google.com


Can anyone shed any light on this?

Regards,
Peter.

2013-06-12 17:00:47,857 INFO  crawl.InjectorJob - InjectorJob: starting at 
2013-06-12 17:00:47
2013-06-12 17:00:47,858 INFO  crawl.InjectorJob - InjectorJob: Injecting 
urlDir: urls

2013-06-12 17:00:48,140 INFO  crawl.InjectorJob - InjectorJob: Using class
org.apache.gora.memory.store.MemStore as the Gora storage class.
2013-06-12 17:00:48,158 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your

platform... using builtin-java classes where applicable
2013-06-12 17:00:48,206 WARN  snappy.LoadSnappy - Snappy native library 
not loaded
2013-06-12 17:00:48,344 INFO  mapreduce.GoraRecordWriter - 
gora.buffer.write.limit = 1
2013-06-12 17:00:48,403 INFO  regex.RegexURLNormalizer - can't find rules 
for scope 'inject', using

default
2013-06-12 17:00:48,407 WARN  mapred.FileOutputCommitter - Output path is 
null in cleanup

2013-06-12 17:00:48,407 WARN  mapred.LocalJobRunner - job_local_0001
java.net.MalformedURLException: no protocol:
   at java.net.URL.init(URL.java:585)
   at java.net.URL.init(URL.java:482)
   at java.net.URL.init(URL.java:431)
   at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44)
   at 
org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162)
   at 
org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)

   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
   at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: 
java.lang.RuntimeException: job

failed: name=[testcrawl]inject urls, jobid=job_local_0001
   at 
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)

   at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
   at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
   at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)






Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Sorry.  I forgot to mention that I'm running a 2.x release taken from a few
weeks ago.


On Wed, Jun 12, 2013 at 8:31 AM, Bai Shen baishen.li...@gmail.com wrote:

 I'm dealing with a lot of file types that I don't want to index.  I was
 originally using the regex filter to exclude them but it was getting out of
 hand.

 I changed my plugin includes from

 urlfilter-regex

 to

 urlfilter-(regex|suffix)

 I've tried using both the default urlfilter-suffix.txt file via adding the
 extensions I don't want and making my own file that starts with + and
 includes the extensions I do want.

 Neither of these approaches seem to work.  I continue to get urls added to
 the database which continue extensions I don't want.  Even adding a
 urlfilter.order section to my nutch-site.xml doesn't work.

 I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
 suggestions for what else to look at?

 Thanks.



RE: Suffix URLFilter not working

2013-06-12 Thread Markus Jelsma
We happily use that filter just as it is shipped with Nutch. Just enabling it 
in plugin.includes works for us. To ease testing you can use the bin/nutch 
org.apache.nutch.net.URLFilterChecker to test filters.
 
 
-Original message-
 From:Bai Shen baishen.li...@gmail.com
 Sent: Wed 12-Jun-2013 14:32
 To: user@nutch.apache.org
 Subject: Suffix URLFilter not working
 
 I'm dealing with a lot of file types that I don't want to index.  I was
 originally using the regex filter to exclude them but it was getting out of
 hand.
 
 I changed my plugin includes from
 
 urlfilter-regex
 
 to
 
 urlfilter-(regex|suffix)
 
 I've tried using both the default urlfilter-suffix.txt file via adding the
 extensions I don't want and making my own file that starts with + and
 includes the extensions I do want.
 
 Neither of these approaches seem to work.  I continue to get urls added to
 the database which continue extensions I don't want.  Even adding a
 urlfilter.order section to my nutch-site.xml doesn't work.
 
 I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
 suggestions for what else to look at?
 
 Thanks.
 


Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
I figured as much, which is why I'm not sure why it's not working for me.

I ran bin/nutch org.apache.nutch.net.URLFilterChecker
http://myserver/myurland it's been thirty minutes with no results.

Is there something I should run before running that?

Thanks.


On Wed, Jun 12, 2013 at 8:34 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 We happily use that filter just as it is shipped with Nutch. Just enabling
 it in plugin.includes works for us. To ease testing you can use the
 bin/nutch org.apache.nutch.net.URLFilterChecker to test filters.


 -Original message-
  From:Bai Shen baishen.li...@gmail.com
  Sent: Wed 12-Jun-2013 14:32
  To: user@nutch.apache.org
  Subject: Suffix URLFilter not working
 
  I'm dealing with a lot of file types that I don't want to index.  I was
  originally using the regex filter to exclude them but it was getting out
 of
  hand.
 
  I changed my plugin includes from
 
  urlfilter-regex
 
  to
 
  urlfilter-(regex|suffix)
 
  I've tried using both the default urlfilter-suffix.txt file via adding
 the
  extensions I don't want and making my own file that starts with + and
  includes the extensions I do want.
 
  Neither of these approaches seem to work.  I continue to get urls added
 to
  the database which continue extensions I don't want.  Even adding a
  urlfilter.order section to my nutch-site.xml doesn't work.
 
  I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
  suggestions for what else to look at?
 
  Thanks.
 



Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Doh!  I really should just read the code of things before posting.

I ran the URLFilterChecker and passed it in a url that the SuffixFilter
should flag and it still passed it.  However, if I change the url to end in
a format that is in the default config file, it rejects the url.

So it looks like the problem is that it's not loading the altered config
file from my conf directory.  Not sure why since the regex filter correctly
finds it's config file.


On Wed, Jun 12, 2013 at 8:34 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 We happily use that filter just as it is shipped with Nutch. Just enabling
 it in plugin.includes works for us. To ease testing you can use the
 bin/nutch org.apache.nutch.net.URLFilterChecker to test filters.


 -Original message-
  From:Bai Shen baishen.li...@gmail.com
  Sent: Wed 12-Jun-2013 14:32
  To: user@nutch.apache.org
  Subject: Suffix URLFilter not working
 
  I'm dealing with a lot of file types that I don't want to index.  I was
  originally using the regex filter to exclude them but it was getting out
 of
  hand.
 
  I changed my plugin includes from
 
  urlfilter-regex
 
  to
 
  urlfilter-(regex|suffix)
 
  I've tried using both the default urlfilter-suffix.txt file via adding
 the
  extensions I don't want and making my own file that starts with + and
  includes the extensions I do want.
 
  Neither of these approaches seem to work.  I continue to get urls added
 to
  the database which continue extensions I don't want.  Even adding a
  urlfilter.order section to my nutch-site.xml doesn't work.
 
  I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
  suggestions for what else to look at?
 
  Thanks.
 



Re: Suffix URLFilter not working

2013-06-12 Thread Bai Shen
Turns out it was because I had a copy of the default file sitting in the
directory I was calling nutch from.

Once I removed that it correctly found my copy in the conf directory.


On Wed, Jun 12, 2013 at 9:29 AM, Bai Shen baishen.li...@gmail.com wrote:

 Doh!  I really should just read the code of things before posting.

 I ran the URLFilterChecker and passed it in a url that the SuffixFilter
 should flag and it still passed it.  However, if I change the url to end in
 a format that is in the default config file, it rejects the url.

 So it looks like the problem is that it's not loading the altered config
 file from my conf directory.  Not sure why since the regex filter correctly
 finds it's config file.


 On Wed, Jun 12, 2013 at 8:34 AM, Markus Jelsma markus.jel...@openindex.io
  wrote:

 We happily use that filter just as it is shipped with Nutch. Just
 enabling it in plugin.includes works for us. To ease testing you can use
 the bin/nutch org.apache.nutch.net.URLFilterChecker to test filters.


 -Original message-
  From:Bai Shen baishen.li...@gmail.com
  Sent: Wed 12-Jun-2013 14:32
  To: user@nutch.apache.org
  Subject: Suffix URLFilter not working
 
  I'm dealing with a lot of file types that I don't want to index.  I was
  originally using the regex filter to exclude them but it was getting
 out of
  hand.
 
  I changed my plugin includes from
 
  urlfilter-regex
 
  to
 
  urlfilter-(regex|suffix)
 
  I've tried using both the default urlfilter-suffix.txt file via adding
 the
  extensions I don't want and making my own file that starts with + and
  includes the extensions I do want.
 
  Neither of these approaches seem to work.  I continue to get urls added
 to
  the database which continue extensions I don't want.  Even adding a
  urlfilter.order section to my nutch-site.xml doesn't work.
 
  I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
  suggestions for what else to look at?
 
  Thanks.
 





Re: Suffix URLFilter not working

2013-06-12 Thread Peter Gaines
I have installed version 2.2 of nutch on a CentIOS machine and am using the 
following command:


./bin/crawl urls testcrawl solrfolder 2

I have attempted to use the default filter configuration and also explicitly 
specified urlfilter-regex

in the nutch-default.xml (without modifying the default regex filters).

However if fails each time and I can see the exception below in the 
hadoop.log.


As you can see it looks like it has not picked up anything from the seed.txt 
in the urls folder

(as MalformedURLException error usually prints the url).
This file has 1 entry with the protocol specified e.g. http://www.google.com

Can anyone shed any light on this?

Regards,
Peter.

2013-06-12 17:00:47,857 INFO  crawl.InjectorJob - InjectorJob: starting at 
2013-06-12 17:00:47
2013-06-12 17:00:47,858 INFO  crawl.InjectorJob - InjectorJob: Injecting 
urlDir: urls
2013-06-12 17:00:48,140 INFO  crawl.InjectorJob - InjectorJob: Using class 
org.apache.gora.memory.store.MemStore as the Gora storage class.
2013-06-12 17:00:48,158 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2013-06-12 17:00:48,206 WARN  snappy.LoadSnappy - Snappy native library not 
loaded
2013-06-12 17:00:48,344 INFO  mapreduce.GoraRecordWriter - 
gora.buffer.write.limit = 1
2013-06-12 17:00:48,403 INFO  regex.RegexURLNormalizer - can't find rules 
for scope 'inject', using default
2013-06-12 17:00:48,407 WARN  mapred.FileOutputCommitter - Output path is 
null in cleanup

2013-06-12 17:00:48,407 WARN  mapred.LocalJobRunner - job_local_0001
java.net.MalformedURLException: no protocol:
   at java.net.URL.init(URL.java:585)
   at java.net.URL.init(URL.java:482)
   at java.net.URL.init(URL.java:431)
   at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44)
   at 
org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162)
   at 
org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)

   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
   at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: 
java.lang.RuntimeException: job failed: name=[testcrawl]inject urls, 
jobid=job_local_0001
   at 
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)

   at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
   at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
   at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)




Re: Suffix URLFilter not working

2013-06-12 Thread Sebastian Nagel
Hi Peter,

please do not hijack threads.

Seed URLs must be fully specified including protocol, e.g.:
 http://nutch.apache.org/
but not
 apache.org

Sebastian

On 06/12/2013 05:08 PM, Peter Gaines wrote:
 I have installed version 2.2 of nutch on a CentIOS machine and am using the 
 following command:
 
 ./bin/crawl urls testcrawl solrfolder 2
 
 I have attempted to use the default filter configuration and also explicitly 
 specified urlfilter-regex
 in the nutch-default.xml (without modifying the default regex filters).
 
 However if fails each time and I can see the exception below in the 
 hadoop.log.
 
 As you can see it looks like it has not picked up anything from the seed.txt 
 in the urls folder
 (as MalformedURLException error usually prints the url).
 This file has 1 entry with the protocol specified e.g. http://www.google.com
 
 Can anyone shed any light on this?
 
 Regards,
 Peter.
 
 2013-06-12 17:00:47,857 INFO  crawl.InjectorJob - InjectorJob: starting at 
 2013-06-12 17:00:47
 2013-06-12 17:00:47,858 INFO  crawl.InjectorJob - InjectorJob: Injecting 
 urlDir: urls
 2013-06-12 17:00:48,140 INFO  crawl.InjectorJob - InjectorJob: Using class
 org.apache.gora.memory.store.MemStore as the Gora storage class.
 2013-06-12 17:00:48,158 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your
 platform... using builtin-java classes where applicable
 2013-06-12 17:00:48,206 WARN  snappy.LoadSnappy - Snappy native library not 
 loaded
 2013-06-12 17:00:48,344 INFO  mapreduce.GoraRecordWriter - 
 gora.buffer.write.limit = 1
 2013-06-12 17:00:48,403 INFO  regex.RegexURLNormalizer - can't find rules for 
 scope 'inject', using
 default
 2013-06-12 17:00:48,407 WARN  mapred.FileOutputCommitter - Output path is 
 null in cleanup
 2013-06-12 17:00:48,407 WARN  mapred.LocalJobRunner - job_local_0001
 java.net.MalformedURLException: no protocol:
at java.net.URL.init(URL.java:585)
at java.net.URL.init(URL.java:482)
at java.net.URL.init(URL.java:431)
at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44)
at 
 org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162)
at 
 org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: 
 java.lang.RuntimeException: job
 failed: name=[testcrawl]inject urls, jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)