Re: Suffix URLFilter not working
My apologies I sent it in error. I have resent the email as a new thread BTW I specified the full protocol i.e. http:// Regards, Peter -Original Message- From: Sebastian Nagel Sent: Wednesday, June 12, 2013 8:54 PM To: user@nutch.apache.org Subject: Re: Suffix URLFilter not working Hi Peter, please do not hijack threads. Seed URLs must be fully specified including protocol, e.g.: http://nutch.apache.org/ but not apache.org Sebastian On 06/12/2013 05:08 PM, Peter Gaines wrote: I have installed version 2.2 of nutch on a CentIOS machine and am using the following command: ./bin/crawl urls testcrawl solrfolder 2 I have attempted to use the default filter configuration and also explicitly specified urlfilter-regex in the nutch-default.xml (without modifying the default regex filters). However if fails each time and I can see the exception below in the hadoop.log. As you can see it looks like it has not picked up anything from the seed.txt in the urls folder (as MalformedURLException error usually prints the url). This file has 1 entry with the protocol specified e.g. http://www.google.com Can anyone shed any light on this? Regards, Peter. 2013-06-12 17:00:47,857 INFO crawl.InjectorJob - InjectorJob: starting at 2013-06-12 17:00:47 2013-06-12 17:00:47,858 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls 2013-06-12 17:00:48,140 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class. 2013-06-12 17:00:48,158 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2013-06-12 17:00:48,206 WARN snappy.LoadSnappy - Snappy native library not loaded 2013-06-12 17:00:48,344 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 1 2013-06-12 17:00:48,403 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2013-06-12 17:00:48,407 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-06-12 17:00:48,407 WARN mapred.LocalJobRunner - job_local_0001 java.net.MalformedURLException: no protocol: at java.net.URL.init(URL.java:585) at java.net.URL.init(URL.java:482) at java.net.URL.init(URL.java:431) at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44) at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162) at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: java.lang.RuntimeException: job failed: name=[testcrawl]inject urls, jobid=job_local_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
Re: Suffix URLFilter not working
Sorry. I forgot to mention that I'm running a 2.x release taken from a few weeks ago. On Wed, Jun 12, 2013 at 8:31 AM, Bai Shen baishen.li...@gmail.com wrote: I'm dealing with a lot of file types that I don't want to index. I was originally using the regex filter to exclude them but it was getting out of hand. I changed my plugin includes from urlfilter-regex to urlfilter-(regex|suffix) I've tried using both the default urlfilter-suffix.txt file via adding the extensions I don't want and making my own file that starts with + and includes the extensions I do want. Neither of these approaches seem to work. I continue to get urls added to the database which continue extensions I don't want. Even adding a urlfilter.order section to my nutch-site.xml doesn't work. I don't see any obvious bugs in the code, so I'm a bit stumped. Any suggestions for what else to look at? Thanks.
RE: Suffix URLFilter not working
We happily use that filter just as it is shipped with Nutch. Just enabling it in plugin.includes works for us. To ease testing you can use the bin/nutch org.apache.nutch.net.URLFilterChecker to test filters. -Original message- From:Bai Shen baishen.li...@gmail.com Sent: Wed 12-Jun-2013 14:32 To: user@nutch.apache.org Subject: Suffix URLFilter not working I'm dealing with a lot of file types that I don't want to index. I was originally using the regex filter to exclude them but it was getting out of hand. I changed my plugin includes from urlfilter-regex to urlfilter-(regex|suffix) I've tried using both the default urlfilter-suffix.txt file via adding the extensions I don't want and making my own file that starts with + and includes the extensions I do want. Neither of these approaches seem to work. I continue to get urls added to the database which continue extensions I don't want. Even adding a urlfilter.order section to my nutch-site.xml doesn't work. I don't see any obvious bugs in the code, so I'm a bit stumped. Any suggestions for what else to look at? Thanks.
Re: Suffix URLFilter not working
I figured as much, which is why I'm not sure why it's not working for me. I ran bin/nutch org.apache.nutch.net.URLFilterChecker http://myserver/myurland it's been thirty minutes with no results. Is there something I should run before running that? Thanks. On Wed, Jun 12, 2013 at 8:34 AM, Markus Jelsma markus.jel...@openindex.iowrote: We happily use that filter just as it is shipped with Nutch. Just enabling it in plugin.includes works for us. To ease testing you can use the bin/nutch org.apache.nutch.net.URLFilterChecker to test filters. -Original message- From:Bai Shen baishen.li...@gmail.com Sent: Wed 12-Jun-2013 14:32 To: user@nutch.apache.org Subject: Suffix URLFilter not working I'm dealing with a lot of file types that I don't want to index. I was originally using the regex filter to exclude them but it was getting out of hand. I changed my plugin includes from urlfilter-regex to urlfilter-(regex|suffix) I've tried using both the default urlfilter-suffix.txt file via adding the extensions I don't want and making my own file that starts with + and includes the extensions I do want. Neither of these approaches seem to work. I continue to get urls added to the database which continue extensions I don't want. Even adding a urlfilter.order section to my nutch-site.xml doesn't work. I don't see any obvious bugs in the code, so I'm a bit stumped. Any suggestions for what else to look at? Thanks.
Re: Suffix URLFilter not working
Doh! I really should just read the code of things before posting. I ran the URLFilterChecker and passed it in a url that the SuffixFilter should flag and it still passed it. However, if I change the url to end in a format that is in the default config file, it rejects the url. So it looks like the problem is that it's not loading the altered config file from my conf directory. Not sure why since the regex filter correctly finds it's config file. On Wed, Jun 12, 2013 at 8:34 AM, Markus Jelsma markus.jel...@openindex.iowrote: We happily use that filter just as it is shipped with Nutch. Just enabling it in plugin.includes works for us. To ease testing you can use the bin/nutch org.apache.nutch.net.URLFilterChecker to test filters. -Original message- From:Bai Shen baishen.li...@gmail.com Sent: Wed 12-Jun-2013 14:32 To: user@nutch.apache.org Subject: Suffix URLFilter not working I'm dealing with a lot of file types that I don't want to index. I was originally using the regex filter to exclude them but it was getting out of hand. I changed my plugin includes from urlfilter-regex to urlfilter-(regex|suffix) I've tried using both the default urlfilter-suffix.txt file via adding the extensions I don't want and making my own file that starts with + and includes the extensions I do want. Neither of these approaches seem to work. I continue to get urls added to the database which continue extensions I don't want. Even adding a urlfilter.order section to my nutch-site.xml doesn't work. I don't see any obvious bugs in the code, so I'm a bit stumped. Any suggestions for what else to look at? Thanks.
Re: Suffix URLFilter not working
Turns out it was because I had a copy of the default file sitting in the directory I was calling nutch from. Once I removed that it correctly found my copy in the conf directory. On Wed, Jun 12, 2013 at 9:29 AM, Bai Shen baishen.li...@gmail.com wrote: Doh! I really should just read the code of things before posting. I ran the URLFilterChecker and passed it in a url that the SuffixFilter should flag and it still passed it. However, if I change the url to end in a format that is in the default config file, it rejects the url. So it looks like the problem is that it's not loading the altered config file from my conf directory. Not sure why since the regex filter correctly finds it's config file. On Wed, Jun 12, 2013 at 8:34 AM, Markus Jelsma markus.jel...@openindex.io wrote: We happily use that filter just as it is shipped with Nutch. Just enabling it in plugin.includes works for us. To ease testing you can use the bin/nutch org.apache.nutch.net.URLFilterChecker to test filters. -Original message- From:Bai Shen baishen.li...@gmail.com Sent: Wed 12-Jun-2013 14:32 To: user@nutch.apache.org Subject: Suffix URLFilter not working I'm dealing with a lot of file types that I don't want to index. I was originally using the regex filter to exclude them but it was getting out of hand. I changed my plugin includes from urlfilter-regex to urlfilter-(regex|suffix) I've tried using both the default urlfilter-suffix.txt file via adding the extensions I don't want and making my own file that starts with + and includes the extensions I do want. Neither of these approaches seem to work. I continue to get urls added to the database which continue extensions I don't want. Even adding a urlfilter.order section to my nutch-site.xml doesn't work. I don't see any obvious bugs in the code, so I'm a bit stumped. Any suggestions for what else to look at? Thanks.
Re: Suffix URLFilter not working
I have installed version 2.2 of nutch on a CentIOS machine and am using the following command: ./bin/crawl urls testcrawl solrfolder 2 I have attempted to use the default filter configuration and also explicitly specified urlfilter-regex in the nutch-default.xml (without modifying the default regex filters). However if fails each time and I can see the exception below in the hadoop.log. As you can see it looks like it has not picked up anything from the seed.txt in the urls folder (as MalformedURLException error usually prints the url). This file has 1 entry with the protocol specified e.g. http://www.google.com Can anyone shed any light on this? Regards, Peter. 2013-06-12 17:00:47,857 INFO crawl.InjectorJob - InjectorJob: starting at 2013-06-12 17:00:47 2013-06-12 17:00:47,858 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls 2013-06-12 17:00:48,140 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class. 2013-06-12 17:00:48,158 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2013-06-12 17:00:48,206 WARN snappy.LoadSnappy - Snappy native library not loaded 2013-06-12 17:00:48,344 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 1 2013-06-12 17:00:48,403 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2013-06-12 17:00:48,407 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-06-12 17:00:48,407 WARN mapred.LocalJobRunner - job_local_0001 java.net.MalformedURLException: no protocol: at java.net.URL.init(URL.java:585) at java.net.URL.init(URL.java:482) at java.net.URL.init(URL.java:431) at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44) at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162) at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: java.lang.RuntimeException: job failed: name=[testcrawl]inject urls, jobid=job_local_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
Re: Suffix URLFilter not working
Hi Peter, please do not hijack threads. Seed URLs must be fully specified including protocol, e.g.: http://nutch.apache.org/ but not apache.org Sebastian On 06/12/2013 05:08 PM, Peter Gaines wrote: I have installed version 2.2 of nutch on a CentIOS machine and am using the following command: ./bin/crawl urls testcrawl solrfolder 2 I have attempted to use the default filter configuration and also explicitly specified urlfilter-regex in the nutch-default.xml (without modifying the default regex filters). However if fails each time and I can see the exception below in the hadoop.log. As you can see it looks like it has not picked up anything from the seed.txt in the urls folder (as MalformedURLException error usually prints the url). This file has 1 entry with the protocol specified e.g. http://www.google.com Can anyone shed any light on this? Regards, Peter. 2013-06-12 17:00:47,857 INFO crawl.InjectorJob - InjectorJob: starting at 2013-06-12 17:00:47 2013-06-12 17:00:47,858 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls 2013-06-12 17:00:48,140 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class. 2013-06-12 17:00:48,158 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2013-06-12 17:00:48,206 WARN snappy.LoadSnappy - Snappy native library not loaded 2013-06-12 17:00:48,344 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 1 2013-06-12 17:00:48,403 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2013-06-12 17:00:48,407 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-06-12 17:00:48,407 WARN mapred.LocalJobRunner - job_local_0001 java.net.MalformedURLException: no protocol: at java.net.URL.init(URL.java:585) at java.net.URL.init(URL.java:482) at java.net.URL.init(URL.java:431) at org.apache.nutch.util.TableUtil.reverseUrl(TableUtil.java:44) at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:162) at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 2013-06-12 17:00:49,300 ERROR crawl.InjectorJob - InjectorJob: java.lang.RuntimeException: job failed: name=[testcrawl]inject urls, jobid=job_local_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)