That's wonderful - what a great list! You guys respond very quickly...
Now I gotta get back to reading the docs as I'm sure most of what I just
asked is already in there...;)
Best!
Paul
-Original Message-
From: John Mendenhall [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 29, 2008
I'm experiencing some trouble in forming simple queries that include
non-alphabetic characters. One specific instance is if I want to search
for the string @test.
If I build up the query using either addRequiredPhrase, addRequiredTerm,
or Query.parse, the search term loses the @ sign at the
Yes. When you index your pages, the text is run through an analyzer
that parses it into tokens. The analyzer does interesting stuff like
lowercasing, throwing away bothersome characters, stemming
(tokenizing the word looking into look because that is the stem
of the verb). There are many
hi,
I finally make the crawler running without exception by build from trunk,
but I found the linkdb cannot crawl anything...and then I dump the crawl db
and seeing this in the metadata:
_pst_:robots_denied(18)
any idea?
--
View this message in context:
Hi All,
I get the same exception when I trying with the nightly build for a static
page, any one can help?
Vicious wrote:
Hi All,
Using the latest nightly build I am trying to run a crawl. I have set the
agent property and all relevant plugin. However as soon as I run the crawl
I get
Hi,
Here is the additional information: before the exception appear, nutch
advertise 2 message:
fetching http://cnn.com
org.apache.tika.mime.MimeUtils load
INFO loading [mime-types.xml]
fetch of http://www.cnn.com/ failed with: java.lang.NullPointerException
Fetcher: done
Seems mime-type has
Hi,
finally I figure out the solution:
go to conf/
rename the old mime-types.xml into anyting else,
then copy tika-mimetypes.xml into the same directory with name
mime-types.xml
the crawler should work now.
in short, this is because 1.0-dev using tika, but old-day mime detection
config file is
hi,
I found the anwserthis is generated because robots.txt disallowed
crawling of the current url.
hope it can help.
Vinci wrote:
hi,
I finally make the crawler running without exception by build from trunk,
but I found the linkdb cannot crawl anything...and then I dump the crawl
hi,
I have some trouble with a site that doing content redirection: nutch can't
crawl this site but can crawl its rss, but unfortunately the link in rss is
redirect to the site -- this is the bad thing, but I found the link i want
is appear in the link as an get parameter:
Hi,
I already add the plugin name into nutch-default.xml, but it still throw
exception ParseException: parser not found for
contentType=application/atom+xml while the rss feed work fine after I added
parse-rss.
I checked the feed support atom feed with mime-type application/atom+xml,
did I miss
I tried to run fetch without parsing by setting the
fetcher.parse property to false. When I ran parse,
it said the segment had already been parsed, by the
fetch process.
It appears NUTCH-337 only fixed the unused
fetcher.parse configuration value in the Fetcher.java
class. I have tried
Does the latest Nutch work with JDK 1.5 or 1.6, and Tomcat 5.5 or 6.0?
Thanks!
Nick
I'm using Nutch 0.9 (the latest stable release) with JDK 1.5, and Tomcat
6.0. I had a problem with JDK 1.6.
CB
-Original Message-
From: Duan, Nick [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 30, 2008 4:50 PM
To: nutch-user@lucene.apache.org
Subject: JDK 1.5 Tomcat 5.5
Does
crawl-urlfilter.txt and regex-urlfilter.txt are used to block or allow
certain URLs to be called. It does not allow you to extract a URL from
another. You might want to use conf/regex-normalize.xml to do this.
Regards,
Susam Pal
On Jan 31, 2008 1:43 AM, Vinci [EMAIL PROTECTED] wrote:
hi,
I
Hi list.
I'm having trouble figuring out why certain pages are being ranked
much higher than others on my Nutch installation.
For example, not long ago, the department of computing's homepage was
ranked #1 for the query computing department.
However, recently it has dropped in the rankings
Do you have the Java heap space options set in the 'mapred.child.java.opts'
property (in conf/hadoop-site.xml)? For a machine with 1gb ram and 1gb swap
space, I set this to '-Xms1024m -Xmx2048m'.
Best,
Siddhartha
On Jan 31, 2008 3:23 AM, John Mendenhall [EMAIL PROTECTED] wrote:
The one task
Although only one of the machines will be used for the fetch task (because
all your urls are from a single host), the other tasks do not have any such
requirements and can run on multiple machines. So running in the distributed
might still benefit you.
To 'turn off' the 3 slaves, you can simply
17 matches
Mail list logo