Re: Re: Funky duplicate url's, getting much worse!

2010-09-29 Thread Julien Nioche
Hi guys, IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could you please open a JIRA and attach a patch for the TestOutlinkExtractor so that we can reproduce the problem? Thanks Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/

Re: What is nutch doing?

2010-09-29 Thread Andrzej Bialecki
On 2010-09-29 01:13, Steve Cohen wrote: fc08b7f0 * *java/util/regex/Matcher.search(I)Z [compiled] +174 (line 2208) fc12c078 * *java/util/regex/Matcher.find()Z [compiled] +132 (line 1058) fc12c078 * *org/apache/nutch/urlfilter/regex/RegexURLFilter$Rule.match(Ljava/lang/String;)Z+18 (line

Re: Funky duplicate url's, getting much worse!

2010-09-29 Thread Julien Nioche
Don't know how to run a single test but if you do *ant test *you should be able to find the logs for each individual class in ./build/test with a separate log for *TEST-org.apache.nutch.parse.TestOutlinkExtractor.txt* that will be easier that going through a single huge file J. On 29 September

Re: Funky duplicate url's, getting much worse!

2010-09-29 Thread Julien Nioche
What I did for similarpages.com was to write a custom URL filter that detected repetition of path elements and discarded a URL if it had a path occurring more than N times. I don't know what regex AJ suggested but the approach above was generic and also quite fast. We also had other things like

Re: Funky duplicate url's, getting much worse!

2010-09-29 Thread Markus Jelsma
The following regex -.*(/[^/]+)/[^/]+\1/[^/]+\1/ prevents URL's such as

GenericOptionsParser

2010-09-29 Thread Steve Cohen
I've been seeing the following error in the hadoop log when I kick off the nutch crawl script. 2010-09-29 10:47:07,837 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. Now, I've looked up the error and see several

Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

2010-09-29 Thread brad
I have tried to move from a local instance of Nutch to a Pseudo-Distributed Mode Hadoop Nutch on a single machine. I set everything up using the How to Setup Nutch (V1.1) and Hadoop instructions located here: http://wiki.apache.org/nutch/NutchHadoopTutorial Then I moved all my relevant files to

Re: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

2010-09-29 Thread Steve Cohen
Did you start up the hadoop daemon? On Wed, Sep 29, 2010 at 3:08 PM, brad b...@bcs-mail.net wrote: I have tried to move from a local instance of Nutch to a Pseudo-Distributed Mode Hadoop Nutch on a single machine. I set everything up using the How to Setup Nutch (V1.1) and Hadoop

Re: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

2010-09-29 Thread Andrzej Bialecki
On 2010-09-29 21:08, brad wrote: I have tried to move from a local instance of Nutch to a Pseudo-Distributed Mode Hadoop Nutch on a single machine. I set everything up using the How to Setup Nutch (V1.1) and Hadoop instructions located here: http://wiki.apache.org/nutch/NutchHadoopTutorial

RE: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

2010-09-29 Thread brad
Thanks Andrzej. It did not occur to me that the path would need to change in my scripts. As for root, is it a risk, if I just using the box for testing? -Original Message- From: Andrzej Bialecki [mailto:a...@getopt.org] Sent: Wednesday, September 29, 2010 12:31 PM To:

How to Index Pure Text into Seperate Fields?

2010-09-29 Thread Savannah Beckett
Hi,   I am using xpath to index different parts of the html pages into different fields.  Now, I have some pure text documents that has no html.  So I can't use xpath.  How do I index these pure text into different fields of the index?  How do I make nutch/solr understand these different parts

Re: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

2010-09-29 Thread Andrzej Bialecki
On 2010-09-29 21:50, brad wrote: Thanks Andrzej. It did not occur to me that the path would need to change in my scripts. As for root, is it a risk, if I just using the box for testing? No, but IMHO it's a bad habit. Later on you will want to move this to a production env. and then a few

RE: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode...

2010-09-29 Thread brad
Thanks. I'll change it when I reconfigure the box. -Original Message- From: Andrzej Bialecki [mailto:a...@getopt.org] Sent: Wednesday, September 29, 2010 2:01 PM To: user@nutch.apache.org Subject: Re: Error with Hadoop when moving from Local to HDFS Pseudo-Distributed Mode... On

Excluding javascript files from indexing and search results.

2010-09-29 Thread Mark Stephenson
Hi, I'm wondering if there's a way to prevent nutch from indexing javascript files. I still would like to fetch and parse javascript files to find valuable outlinks, but I don't want them to show up in my search results. Is there a good way to do this? Thanks a lot, Mark

RE: Excluding javascript files from indexing and search results.

2010-09-29 Thread Arkadi.Kosmynin
Hi Mark, I am not sure, maybe there is a simpler way, but if you want to something to be fetched and processed but not indexed, you can write an index filter plugin and return null for documents that you don't want in the index. This is relatively easy to do, just use the index-basic filter as