Re: Nutch 2.0

2010-06-24 Thread Julien Nioche
So things like NUTCH-666 should be ported to SOLR to use on Nutch 2.0? ok. problem is some things like NUTCH-666 are required both on indexing and definitely on the SOLR side from now on btw,when Nutch 2.0 is planned to be released?1.1 took several years... we are still at a very early

Re: Question on normalizing urls / RegexURLNormalizer

2010-06-24 Thread Hannes Carl Meyer
Nope, that changes nothing. Just checked out my log file: 2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: looking in: /~/apache-nutch-1.1-bin/plugins 2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2010-06-24 17:13:41,439 INFO

Re: Question on normalizing urls / RegexURLNormalizer

2010-06-24 Thread Hannes Carl Meyer
Just tried it in nutch-1.0 with the same kind of behavior: hc.me...@server01:~/nutch-1.0 ./bin/nutch plugin urlnormalizer-regex org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer http://www.myinputurl.com Plugin 'urlnormalizer-regex' not present or inactive. (it is present and it is

Re: Question on normalizing urls / RegexURLNormalizer

2010-06-24 Thread reinhard schwab
hi hannes, i have identified your problem. your nutch-site.xml plugin.includes property contains a newline after urlnormalizer-(basic|pass|regex), which breaks pattern matching in PluginRepository.java. property nameplugin.includes/name

Re: Staying in Domain

2010-06-24 Thread Max Lynch
On Wed, Jun 23, 2010 at 5:27 PM, Dennis Kubes ku...@apache.org wrote: You may still see some urls that *seem* to be outside of your domains list while using the domain urlfilter. Remember the following: 1. Urls are checked in order of domain suffix, domain name, and hostname. If you

Re: Question on normalizing urls / RegexURLNormalizer

2010-06-24 Thread Hannes Carl Meyer
Awesome... Thank you very very much :-) On Thu, Jun 24, 2010 at 6:55 PM, reinhard schwab reinhard.sch...@aon.atwrote: hi hannes, i have identified your problem. your nutch-site.xml plugin.includes property contains a newline after urlnormalizer-(basic|pass|regex), which breaks pattern

Indexing only PDFs

2010-06-24 Thread Max Lynch
Hi, I would like to crawl a list of pages but only index PDFs. From what I gather I can add an exclusion for all non .pdf extensions in crawl-urlfilter.txt. However, I would also like to apply an additional restriction, that I only index pages that match a certain query. In my head, this

Re: Parsing PostScript files

2010-06-24 Thread Andrzej Bialecki
On 2010-06-24 10:56, arkadi.kosmy...@csiro.au wrote: Hi, It looks like Tika does not include a PostScript parser. At least the copy that comes with Nutch 1.1. Is this right? I just want to double check because PostScript is a major file format. I get errors Can't retrieve Tika parser for

Re: Hadoop Level Distributed Cache

2010-06-24 Thread Dennis Kubes
Please do I am very interested in how a solution like that works for you in terms of performance. Dennis On 06/24/2010 12:10 PM, Emmanuel de Castro Santana wrote: Thank you for all the help Dennis, all of this is valid information to me ! I am trying a solution using Memcache, will post