Re: Searching with and and or?

2006-10-08 Thread Nguyen Ngoc Giang
You can have a look at NutchAnalysis.jj and create some customized rules for your own keywords. Cheers, On 9/29/06, Stefan Neufeind [EMAIL PROTECTED] wrote: Hi, I'm trying to build a search like searchword AND (site:www.example.com OR site:www.foobar.org) But no such syntax I tried

Re: Boolean OR QueryFilter

2006-03-15 Thread Nguyen Ngoc Giang
this in the 0.8 release? Since it IS open source. ;) Just a thought, Alex -Original Message- From: Nguyen Ngoc Giang [mailto:[EMAIL PROTECTED] Sent: Wednesday, 15 March 2006 3:45 PM To: nutch-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: Boolean OR QueryFilter Hi David, I

Re: Boolean OR QueryFilter

2006-03-15 Thread Nguyen Ngoc Giang
approach. Note also that you will probably need to change BasicQueryFilter and perhaps other filters to work correctly with optional terms. Nguyen Ngoc Giang wrote: Sorry, I'm a newbie in OS, and I'm not familiar to the way of updating patches :D I'll try to put my solution here first

Re: Boolean OR QueryFilter

2006-03-14 Thread Nguyen Ngoc Giang
Hi David, I also did a similar task. In fact, I hacked into jj code to add the definition for OR and NOT. If you need any help, don't hesitate to contact me :). Regards, Giang PS: I also believe that a hack to jj code is necessary. On 3/8/06, David Odmark [EMAIL PROTECTED] wrote: Hi

Bug in closing the database?

2006-02-10 Thread Nguyen Ngoc Giang
Hi everyone, I'm constantly encounter this problem when Nutch comes to database closing stage. The crawler causes my system hung and needs to be restarted. Can anyone help me to figure out this! Here is my log file before hanging: 060210 161454 Finishing update 060210 161959 Processing

Closing database causes system crash

2006-01-15 Thread Nguyen Ngoc Giang
Hi folks, I'm struggling with the Nutch crawler at closing database step. I'm running on Redhat Enterprise, 4G RAM, JDK 1.5.06. The database size is around few million pages. I usually get the system crash when Nutch comes to closing database (both during update database or fetchlist). The

Re: About ranking in Nutch

2006-01-08 Thread Nguyen Ngoc Giang
I think Nutch use PageRank algorithm, but of course, the algorithm which Google is using is much complicated than what it has been described in their paper. You can probably find the code in org.apache.nutch.tools.LinkAnalysisTooland org.apache.nutch.tools.DistributedAnalysisTools. Regards,

Re: java.io.IOException: already exists

2006-01-04 Thread Nguyen Ngoc Giang
blocking ports? Do you use NDFS or local? Are you on NTFS or FAT32 file system? How large is the dataset you are working with? Have you split into more smaller jobs instead of big/large jobs? --- Nguyen Ngoc Giang [EMAIL PROTECTED] wrote: Hi all, I'd like to bring back this topic

Read Time out problem

2005-12-21 Thread Nguyen Ngoc Giang
Hi folks, When I try crawling, there are many Read Timeout error. It seems that this error is not caught as properly as http.max.delays. I would like to catch this error in the same manner with http.max.delays, that is to retry the page with this error. Can anyone suggest a way? Any can

Re: Read Time out problem

2005-12-21 Thread Nguyen Ngoc Giang
-default/site.xml I'm not sure but I think these kind of failed urls are also tried to refech another time (db.fetch.retry.max) HTH Stefan Am 21.12.2005 um 09:00 schrieb Nguyen Ngoc Giang: Hi folks, When I try crawling, there are many Read Timeout error. It seems that this error

Re: How to recrawl urls

2005-12-19 Thread Nguyen Ngoc Giang
, but I would like to confirm that. I do see a variable that define the next fetch interval but I am not sure of it. If anyone has more information on this regard please let me know. Thank you in advance, On 12/19/05, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote: As I understand, by default

Re: Nutch Tomcat5 or.apache.jasper.JasperException

2005-12-10 Thread Nguyen Ngoc Giang
It seems that you are running Tomcat at the wrong place. Make sure that you have to put the db and segments directory at the place you run Tomcat. For example, if you put your Nutch file under ROOT and db and segments are also there, then you should move to ROOT and start Tomcat as

How to get page content given URL only?

2005-12-09 Thread Nguyen Ngoc Giang
Hi everyone, I'm writing a small program which just utilizes Nutch as a crawler only, with no search functionality. The program should be able to return page content given an url input. I would like to ask how can we get the page content given only the URL, since webdb only provides a

Re: How to get page content given URL only?

2005-12-09 Thread Nguyen Ngoc Giang
Groschupf [EMAIL PROTECTED] wrote: Take a look to the cache page, it returns the content from the segment. Am 09.12.2005 um 09:24 schrieb Nguyen Ngoc Giang: Hi everyone, I'm writing a small program which just utilizes Nutch as a crawler only, with no search

Plugin path in Nutch web

2005-12-08 Thread Nguyen Ngoc Giang
Hi everyone, I'm writing an JSP program to allow crawling via web. My JSP script follows nutch.tools.CrawlTool, which try to create database, inject database, fecth and index. I have difficulty of identifying the plugins. Creating database is fine, because it doesn't require any plugin.