Hey,
I need stemming in my search engine based on Nutch 0.7.2, the stemming
query is being created but I am not getting appropriate results.
If I search for hotel, I get 11 results, but if I search for hotels, I
get 1 result.
Any thoughts?
I have implemented stemming using the code in the mail
I need stemming in my search engine based on Nutch 0.7.2, the stemming
query is being created but I am not getting appropriate results.
If I search for hotel, I get 11 results, but if I search for hotels, I
get 1 result.
You got one result that contains both hotel and hotels ... no?
yeah that page had both hotel and hotels, but shouldn't it have been
all pages that contain hotel or hotels or both. thats what stemming is
supposed to do.
I have 2 pages that contain 'groves' and no page containing 'grove', I
get no result when stemmer plugin is enabled.
Am I hitting the
yeah that page had both hotel and hotels, but shouldn't it have been
all pages that contain hotel or hotels or both. thats what stemming is
supposed to do.
Yes, that's what stemming is supposed to do.
But take a look at your query (that I have cut and paste in my previous
mail): both hotel and
I am using the code as given at
http://www.nabble.com/RE%3A-Nutch-does-not-use-stemmers--p249520.html
On 6/29/06, Jérôme Charron [EMAIL PROTECTED] wrote:
Yes, that's what stemming is supposed to do.
But take a look at your query (that I have cut and paste in my previous
mail): both hotel and
I am using the code as given at
http://www.nabble.com/RE%3A-Nutch-does-not-use-stemmers--p249520.html
Deactivate the basic query filter and it should work.
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
I tried this but there is an issue.
If I search for 'Hotel' only search for hotel gets done, and pages
containing only hotels miss out the action. if I search for
hotels, again the search happens for hotel only and pages having
just the word hotels miss out of the results.
Also because of
Hi, .
I've gotten a couple of questions offlist about stemming
so I thought I'd just post here with my changes. Sorry that
some of the changes are in the main code and not in a plugin. It
seemed that it's more efficient to put in the main analyzer. It
would be nice if later releases could add
Hello all
I've seen this mentioned in the mailing list before but nobody provided
a solution yet (or I didn't find it).
The problem is that this deflateBytes seems to hang for long periods
(from minutes to more than an hour) making the whole crawling process
really slow. I'm crawling a
Hi Daniel,
I've seen this mentioned in the mailing list before but nobody
provided a solution yet (or I didn't find it).
The problem is that this deflateBytes seems to hang for long
periods (from minutes to more than an hour) making the whole
crawling process really slow. I'm crawling a
We have seen this before too. If is the same problem it is the regex
url filter. Comment out the -.*(/.+?)/.*?\1/.*?\1/ expression in the
regex-urlfilter.txt file and it should resolve itself. Also search the
forum for Fetcher stops pushes cpu to 100%.
Dennis
Daniel Varela Santoalla
Hello Ken
I don't know, during these hangs there seems to be no CPU or disk
activity (the index directory keeps exactly the same size). And in this
case the site is in the LAN, so it should be quite fast to get even big
files. Before we had the fetch size limited to 64M, but now it is
Maybe someone can explain to me how this works.
First, my setup.
I create a fetchlist each night with FreeFetchlistTool and fetch those
pages. It often contains the same URLS that are already in the database,
but this tool gets the newest copies of those URLs.
I also run nutch dedup after
When I was researching this issue I first thought it was the
deflateBytes method as well but when I changed things in the code the
problem persisted until I changed the regex filter. Maybe your problem
actually is in the deflate bytes method. The forum I was talking about
earlier was
All,
Is there a way to get around having to have the input value class and
output value class be the same? I have an object writable that I am
trying to unwrap.
Dennis
Hi, Dima
Thanks for Your contribution. I'll try it on this sunday.
Hi, .
I've gotten a couple of questions offlist about stemming
so I thought I'd just post here with my changes. Sorry that
some of the changes are in the main code and not in a plugin. It
seemed that it's more efficient to
I am trying to setup Nutch 0.7.2 to spider my corporate intranet site, which
has only one domain name. My search results almost always hit the
hits-per-site limit, causing me to see a lot of (more from my.domain)
links under the hit results. This severely limits the usefulness of the hit
Now I feel kind of embarrassed -- I've just found an easy solution to my
problem. I found int hitsPerSite = 2; in the search.jsp file, which I've
changed to 0. This is what the (more from DOMAIN) link uses.
On 6/29/06, Ted B [EMAIL PROTECTED] wrote:
I am trying to setup Nutch 0.7.2 to spider
Hi,
may be have a look to the nutch indexer it use a kind of wrapper, may
be this can help you.
Also please browse the haddop developer list archive since there was
some related discussion.
HTH
Stefan
Am 29.06.2006 um 14:41 schrieb Dennis Kubes:
All,
Is there a way to get around having to
The indexer uses an ObjectWritable and I am using that trick. Problem
is I need to input and ObjectWritable but output a different object. I
will take a look at the hadoop list.
Dennis
Stefan Groschupf wrote:
Hi,
may be have a look to the nutch indexer it use a kind of wrapper, may
be
In worst case,( I do this sometime) you have to split your task in
several different jobs.
Ugly but it works.
In general the problem is known, however if you put it again on the
table in the hadoop developer list, it may be get some more priority.
Stefan
On 29.06.2006, at 21:09, Dennis
21 matches
Mail list logo