Stemming in Nutch 0.7.2 issue

2006-06-29 Thread Jayant Kumar Gandhi
Hey, I need stemming in my search engine based on Nutch 0.7.2, the stemming query is being created but I am not getting appropriate results. If I search for hotel, I get 11 results, but if I search for hotels, I get 1 result. Any thoughts? I have implemented stemming using the code in the mail

Re: Stemming in Nutch 0.7.2 issue

2006-06-29 Thread Jérôme Charron
I need stemming in my search engine based on Nutch 0.7.2, the stemming query is being created but I am not getting appropriate results. If I search for hotel, I get 11 results, but if I search for hotels, I get 1 result. You got one result that contains both hotel and hotels ... no?

Re: Stemming in Nutch 0.7.2 issue

2006-06-29 Thread Jayant Kumar Gandhi
yeah that page had both hotel and hotels, but shouldn't it have been all pages that contain hotel or hotels or both. thats what stemming is supposed to do. I have 2 pages that contain 'groves' and no page containing 'grove', I get no result when stemmer plugin is enabled. Am I hitting the

Re: Stemming in Nutch 0.7.2 issue

2006-06-29 Thread Jérôme Charron
yeah that page had both hotel and hotels, but shouldn't it have been all pages that contain hotel or hotels or both. thats what stemming is supposed to do. Yes, that's what stemming is supposed to do. But take a look at your query (that I have cut and paste in my previous mail): both hotel and

Re: Stemming in Nutch 0.7.2 issue

2006-06-29 Thread Jayant Kumar Gandhi
I am using the code as given at http://www.nabble.com/RE%3A-Nutch-does-not-use-stemmers--p249520.html On 6/29/06, Jérôme Charron [EMAIL PROTECTED] wrote: Yes, that's what stemming is supposed to do. But take a look at your query (that I have cut and paste in my previous mail): both hotel and

Re: Stemming in Nutch 0.7.2 issue

2006-06-29 Thread Jérôme Charron
I am using the code as given at http://www.nabble.com/RE%3A-Nutch-does-not-use-stemmers--p249520.html Deactivate the basic query filter and it should work. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: Stemming in Nutch 0.7.2 issue

2006-06-29 Thread Jayant Kumar Gandhi
I tried this but there is an issue. If I search for 'Hotel' only search for hotel gets done, and pages containing only hotels miss out the action. if I search for hotels, again the search happens for hotel only and pages having just the word hotels miss out of the results. Also because of

stemming

2006-06-29 Thread Dima Mazmanov
Hi, . I've gotten a couple of questions offlist about stemming so I thought I'd just post here with my changes. Sorry that some of the changes are in the main code and not in a plugin. It seemed that it's more efficient to put in the main analyzer. It would be nice if later releases could add

Fetcher hanging temporarily on deflateBytes method

2006-06-29 Thread Daniel Varela Santoalla
Hello all I've seen this mentioned in the mailing list before but nobody provided a solution yet (or I didn't find it). The problem is that this deflateBytes seems to hang for long periods (from minutes to more than an hour) making the whole crawling process really slow. I'm crawling a

Re: Fetcher hanging temporarily on deflateBytes method

2006-06-29 Thread Ken Krugler
Hi Daniel, I've seen this mentioned in the mailing list before but nobody provided a solution yet (or I didn't find it). The problem is that this deflateBytes seems to hang for long periods (from minutes to more than an hour) making the whole crawling process really slow. I'm crawling a

Re: Fetcher hanging temporarily on deflateBytes method

2006-06-29 Thread Dennis Kubes
We have seen this before too. If is the same problem it is the regex url filter. Comment out the -.*(/.+?)/.*?\1/.*?\1/ expression in the regex-urlfilter.txt file and it should resolve itself. Also search the forum for Fetcher stops pushes cpu to 100%. Dennis Daniel Varela Santoalla

Re: Fetcher hanging temporarily on deflateBytes method

2006-06-29 Thread Daniel Varela Santoalla
Hello Ken I don't know, during these hangs there seems to be no CPU or disk activity (the index directory keeps exactly the same size). And in this case the site is in the LAN, so it should be quite fast to get even big files. Before we had the fetch size limited to 64M, but now it is

deleting URL duplicates - never actually deleted?

2006-06-29 Thread Honda-Search Administrator
Maybe someone can explain to me how this works. First, my setup. I create a fetchlist each night with FreeFetchlistTool and fetch those pages. It often contains the same URLS that are already in the database, but this tool gets the newest copies of those URLs. I also run nutch dedup after

Re: Fetcher hanging temporarily on deflateBytes method

2006-06-29 Thread Dennis Kubes
When I was researching this issue I first thought it was the deflateBytes method as well but when I changed things in the code the problem persisted until I changed the regex filter. Maybe your problem actually is in the deflate bytes method. The forum I was talking about earlier was

Input and Output Value Class Types

2006-06-29 Thread Dennis Kubes
All, Is there a way to get around having to have the input value class and output value class be the same? I have an object writable that I am trying to unwrap. Dennis

Re: stemming

2006-06-29 Thread bb300
Hi, Dima Thanks for Your contribution. I'll try it on this sunday. Hi, . I've gotten a couple of questions offlist about stemming so I thought I'd just post here with my changes. Sorry that some of the changes are in the main code and not in a plugin. It seemed that it's more efficient to

Disabling hits-per-site limit

2006-06-29 Thread Ted B
I am trying to setup Nutch 0.7.2 to spider my corporate intranet site, which has only one domain name. My search results almost always hit the hits-per-site limit, causing me to see a lot of (more from my.domain) links under the hit results. This severely limits the usefulness of the hit

Re: Disabling hits-per-site limit

2006-06-29 Thread Ted B
Now I feel kind of embarrassed -- I've just found an easy solution to my problem. I found int hitsPerSite = 2; in the search.jsp file, which I've changed to 0. This is what the (more from DOMAIN) link uses. On 6/29/06, Ted B [EMAIL PROTECTED] wrote: I am trying to setup Nutch 0.7.2 to spider

Re: Input and Output Value Class Types

2006-06-29 Thread Stefan Groschupf
Hi, may be have a look to the nutch indexer it use a kind of wrapper, may be this can help you. Also please browse the haddop developer list archive since there was some related discussion. HTH Stefan Am 29.06.2006 um 14:41 schrieb Dennis Kubes: All, Is there a way to get around having to

Re: Input and Output Value Class Types

2006-06-29 Thread Dennis Kubes
The indexer uses an ObjectWritable and I am using that trick. Problem is I need to input and ObjectWritable but output a different object. I will take a look at the hadoop list. Dennis Stefan Groschupf wrote: Hi, may be have a look to the nutch indexer it use a kind of wrapper, may be

Re: Input and Output Value Class Types

2006-06-29 Thread Stefan Groschupf
In worst case,( I do this sometime) you have to split your task in several different jobs. Ugly but it works. In general the problem is known, however if you put it again on the table in the hadoop developer list, it may be get some more priority. Stefan On 29.06.2006, at 21:09, Dennis