You can have a look at NutchAnalysis.jj and create some customized rules
for your own keywords.
Cheers,
On 9/29/06, Stefan Neufeind [EMAIL PROTECTED] wrote:
Hi,
I'm trying to build a search like
searchword AND (site:www.example.com OR site:www.foobar.org)
But no such syntax I tried
this in the 0.8 release? Since it IS open source.
;)
Just a thought,
Alex
-Original Message-
From: Nguyen Ngoc Giang [mailto:[EMAIL PROTECTED]
Sent: Wednesday, 15 March 2006 3:45 PM
To: nutch-user@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Boolean OR QueryFilter
Hi David,
I
approach. Note also that you will probably need
to change BasicQueryFilter and perhaps other filters to work correctly
with optional terms.
Nguyen Ngoc Giang wrote:
Sorry, I'm a newbie in OS, and I'm not familiar to the way of updating
patches :D
I'll try to put my solution here first
Hi David,
I also did a similar task. In fact, I hacked into jj code to add the
definition for OR and NOT. If you need any help, don't hesitate to contact
me :).
Regards,
Giang
PS: I also believe that a hack to jj code is necessary.
On 3/8/06, David Odmark [EMAIL PROTECTED] wrote:
Hi
Hi everyone,
I'm constantly encounter this problem when Nutch comes to database closing
stage. The crawler causes my system hung and needs to be restarted. Can
anyone help me to figure out this! Here is my log file before hanging:
060210 161454 Finishing update
060210 161959 Processing
Hi folks,
I'm struggling with the Nutch crawler at closing database step. I'm
running on Redhat Enterprise, 4G RAM, JDK 1.5.06. The database size is
around few million pages. I usually get the system crash when Nutch comes to
closing database (both during update database or fetchlist). The
I think Nutch use PageRank algorithm, but of course, the algorithm which
Google is using is much complicated than what it has been described in their
paper.
You can probably find the code in org.apache.nutch.tools.LinkAnalysisTooland
org.apache.nutch.tools.DistributedAnalysisTools.
Regards,
blocking ports?
Do you use NDFS or local?
Are you on NTFS or FAT32 file system?
How large is the dataset you are working with? Have
you split into more smaller jobs instead of big/large
jobs?
--- Nguyen Ngoc Giang [EMAIL PROTECTED] wrote:
Hi all,
I'd like to bring back this topic
Hi folks,
When I try crawling, there are many Read Timeout error. It seems that this
error is not caught as properly as http.max.delays. I would like to catch
this error in the same manner with http.max.delays, that is to retry the
page with this error. Can anyone suggest a way? Any can
-default/site.xml
I'm not sure but I think these kind of failed urls are also tried to
refech another time (db.fetch.retry.max)
HTH
Stefan
Am 21.12.2005 um 09:00 schrieb Nguyen Ngoc Giang:
Hi folks,
When I try crawling, there are many Read Timeout error. It seems
that this
error
, but I would like to confirm that. I do see
a
variable that define the next fetch interval but I am not sure of it. If
anyone has more information on this regard please let me know.
Thank you in advance,
On 12/19/05, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote:
As I understand, by default
It seems that you are running Tomcat at the wrong place. Make sure that you
have to put the db and segments directory at the place you run Tomcat.
For example, if you put your Nutch file under ROOT and db and segments
are also there, then you should move to ROOT and start Tomcat as
Hi everyone,
I'm writing a small program which just utilizes Nutch as a crawler only,
with no search functionality. The program should be able to return page
content given an url input. I would like to ask how can we get the page
content given only the URL, since webdb only provides a
Groschupf [EMAIL PROTECTED] wrote:
Take a look to the cache page, it returns the content from the
segment.
Am 09.12.2005 um 09:24 schrieb Nguyen Ngoc Giang:
Hi everyone,
I'm writing a small program which just utilizes Nutch as a
crawler only,
with no search
Hi everyone,
I'm writing an JSP program to allow crawling via web. My JSP script
follows nutch.tools.CrawlTool, which try to create database, inject
database, fecth and index.
I have difficulty of identifying the plugins. Creating database is fine,
because it doesn't require any plugin.
15 matches
Mail list logo