Thanks all,
but how nutch handle this problem? am aware of nutch but not in
depth. If i search the keyword "about us" , nutch gives me exactly what i
want. Is there any scoring techinques? please let me know.
--
View this message in context:
http://www.nabble.com/Searching-doubt-tp2
(sorry, tangent. I'll be quick)
On Tue, Aug 4, 2009 at 8:42 AM, Shai Erera wrote:
> Interesting ... I don't have access to a Japanese dictionary, so I just
> extract bi-grams.
Shai - if you're interested in parsing Japanese, check out Kakasi. It
can split into words and convert Kanji->Katakana/Hi
may hurt recall severely.
Shai
On Tue, Aug 4, 2009 at 7:34 PM, N Hira wrote:
>
> Good summary, Shai.
>
> I've missed some of this thread as well, but does anyone know what happened
> to the suggestion about query manipulation?
>
> e.g., query (about us) => query("abo
t;creditcard")
Regards,
-h
- Original Message
From: Shai Erera
To: java-user@lucene.apache.org
Sent: Tuesday, August 4, 2009 10:31:46 AM
Subject: Re: Searching doubt
Hi Darren,
The question was, how given a string "aboutus" in a document, you can return
that document a
Well.. search on both anyhow.
"about us" OR "aboutus" should hit the spot I think.
Matt
Ian Lea wrote:
The question was, how given a string "aboutus" in a document, you can return
that document as a result to the query "about us" (note the space). So we're
mostly discussing how to detect and t
> The question was, how given a string "aboutus" in a document, you can return
> that document as a result to the query "about us" (note the space). So we're
> mostly discussing how to detect and then break the word "aboutus" to two
> words.
I haven't really been following this thread so apologies
Interesting ... I don't have access to a Japanese dictionary, so I just
extract bi-grams. But I guess that in this case, if one can access an
English dictionary (are you aware of an "open-source" one, or free one
BTW?), one can use the method you mention.
But still, doing this for every Token you
On Tue, Aug 4, 2009 at 8:31 AM, Shai Erera wrote:
> Hi Darren,
>
> The question was, how given a string "aboutus" in a document, you can return
> that document as a result to the query "about us" (note the space). So we're
> mostly discussing how to detect and then break the word "aboutus" to two
>
A, ok. Interesting problem there as well.
I'll think on that one some too!
cheers.
> Hi Darren,
>
> The question was, how given a string "aboutus" in a document, you can
> return
> that document as a result to the query "about us" (note the space). So
> we're
> mostly discussing how to detec
Hi Darren,
The question was, how given a string "aboutus" in a document, you can return
that document as a result to the query "about us" (note the space). So we're
mostly discussing how to detect and then break the word "aboutus" to two
words.
What you wrote though seems interesting as well, onl
Just catching this thread, but if I understand what is being asked I can
share how I do multi-word phrase matching. If that's not what's wanted,
pardons!
Ok, I load an entire dictionary into a lucene index, phrases and all.
When I'm scanning some text, I do lookups in this dictionary index using
On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera wrote:
> 2) Use a dictionary (real dictionary), and search it for every substring,
> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there.
> This needs some fine tuning, like checking if the rest is also a word and if
> the full strin
If you don't know which tokens you'll face, then it's really a much harder
problem. If you know where the token is, e.g. it's always in
http://some.example.site/a/b//index.html,
then it eases the task a bit. Otherwise you'll need to search every single
token produced. I can think of several ways to
Thanks ,
i've noticed that , but the code is for known tokens, how do i
do it for dynamic tokens , meaning , i don't know the urls , someone picked
up the urls and i'll index it. Is there any technique to use while indexing
? am using lucene 2.4.0 version. Please suggest me.
--
Vie
Well, if you have more cases like "aboutus", then I think the TokenFilter
approach will help you. You should create your own Analyzer which receives
another Analyzer as argument, and impl it's tokenStream() like this (it's
the general idea):
public TokenStream tokenStream(String fld, Reader reader
Thanks for your reply,
my original code snippet is
IndexSearcher searcher = new IndexSearcher(indexDir);
Analyzer analyzer = new StopAnalyzer();
BooleanClause.Occur[] flags = { BooleanClause.Occur.SHOULD,
Boolea
I don't see that you use the Analyzer anywhere (i.e. it's created by not
used?).
Also, the wildcard query you create may be very inefficient, as it will
expand all the terms under the DEFAULT_FIELD. If the DEFAULT_FIELD is the
field where all your "default searchable" terms are indexed, there coul
Thanks
This is my codw snippet
IndexSearcher searcher = new IndexSearcher(indexDir);
Analyzer analyzer = new StopAnalyzer();
WildcardQuery query = new WildcardQuery(new
Term(DEFAULT_FIELD));
searcher.search(
I can think of another approach - during indexing, capture the word
"aboutus" and index it as "about us" and "aboutus" in the same position.
That way both queries will work. You'd need to write your own TokenFilter,
maybe a SynonymTokenFilter (since this reminds me of "synonyms" usage) that
accept
Hi Harig,
What you are trying to do is search for 2 tokens as one. You'd have to index
the url as you want for the token to be searchable. Else you might try a
wildcard query .
--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com
The facts expressed here belong to everybody, the opinions to m
20 matches
Mail list logo