Hi guys,
IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could
you please open a JIRA and attach a patch for the TestOutlinkExtractor so
that we can reproduce the problem?
Thanks
Julien
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
On 2010-09-29 01:13, Steve Cohen wrote:
fc08b7f0 * *java/util/regex/Matcher.search(I)Z [compiled] +174 (line 2208)
fc12c078 * *java/util/regex/Matcher.find()Z [compiled] +132 (line 1058)
fc12c078 *
*org/apache/nutch/urlfilter/regex/RegexURLFilter$Rule.match(Ljava/lang/String;)Z+18
(line
Don't know how to run a single test but if you do *ant test *you should be
able to find the logs for each individual class in ./build/test with a
separate log for *TEST-org.apache.nutch.parse.TestOutlinkExtractor.txt*
that will be easier that going through a single huge file
J.
On 29 September
What I did for similarpages.com was to write a custom URL filter that
detected repetition of path elements and discarded a URL if it had a path
occurring more than N times. I don't know what regex AJ suggested but the
approach above was generic and also quite fast.
We also had other things like
The following regex
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
prevents URL's such as
I've been seeing the following error in the hadoop log when I kick off the
nutch crawl script.
2010-09-29 10:47:07,837 WARN mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
Now, I've looked up the error and see several
I have tried to move from a local instance of Nutch to a Pseudo-Distributed
Mode Hadoop Nutch on a single machine. I set everything up using the How to
Setup Nutch (V1.1) and Hadoop instructions located here:
http://wiki.apache.org/nutch/NutchHadoopTutorial
Then I moved all my relevant files to
Did you start up the hadoop daemon?
On Wed, Sep 29, 2010 at 3:08 PM, brad b...@bcs-mail.net wrote:
I have tried to move from a local instance of Nutch to a Pseudo-Distributed
Mode Hadoop Nutch on a single machine. I set everything up using the How
to
Setup Nutch (V1.1) and Hadoop
On 2010-09-29 21:08, brad wrote:
I have tried to move from a local instance of Nutch to a Pseudo-Distributed
Mode Hadoop Nutch on a single machine. I set everything up using the How to
Setup Nutch (V1.1) and Hadoop instructions located here:
http://wiki.apache.org/nutch/NutchHadoopTutorial
Thanks Andrzej. It did not occur to me that the path would need to change
in my scripts.
As for root, is it a risk, if I just using the box for testing?
-Original Message-
From: Andrzej Bialecki [mailto:a...@getopt.org]
Sent: Wednesday, September 29, 2010 12:31 PM
To:
Hi,
I am using xpath to index different parts of the html pages into different
fields. Now, I have some pure text documents that has no html. So I can't use
xpath. How do I index these pure text into different fields of the index? How
do I make nutch/solr understand these different parts
On 2010-09-29 21:50, brad wrote:
Thanks Andrzej. It did not occur to me that the path would need to change
in my scripts.
As for root, is it a risk, if I just using the box for testing?
No, but IMHO it's a bad habit. Later on you will want to move this to a
production env. and then a few
Thanks. I'll change it when I reconfigure the box.
-Original Message-
From: Andrzej Bialecki [mailto:a...@getopt.org]
Sent: Wednesday, September 29, 2010 2:01 PM
To: user@nutch.apache.org
Subject: Re: Error with Hadoop when moving from Local to HDFS
Pseudo-Distributed Mode...
On
Hi,
I'm wondering if there's a way to prevent nutch from indexing
javascript files. I still would like to fetch and parse javascript
files to find valuable outlinks, but I don't want them to show up in
my search results. Is there a good way to do this?
Thanks a lot,
Mark
Hi Mark,
I am not sure, maybe there is a simpler way, but if you want to something to be
fetched and processed but not indexed, you can write an index filter plugin and
return null for documents that you don't want in the index. This is relatively
easy to do, just use the index-basic filter as
15 matches
Mail list logo