Re: Searching doubt

2009-08-04 Thread m.harig
Thanks all, but how nutch handle this problem? am aware of nutch but not in depth. If i search the keyword "about us" , nutch gives me exactly what i want. Is there any scoring techinques? please let me know. -- View this message in context: http://www.nabble.com/Searching-doubt-tp2

Re: Searching doubt

2009-08-04 Thread Phil Whelan
(sorry, tangent. I'll be quick) On Tue, Aug 4, 2009 at 8:42 AM, Shai Erera wrote: > Interesting ... I don't have access to a Japanese dictionary, so I just > extract bi-grams. Shai - if you're interested in parsing Japanese, check out Kakasi. It can split into words and convert Kanji->Katakana/Hi

Re: Searching doubt

2009-08-04 Thread Shai Erera
may hurt recall severely. Shai On Tue, Aug 4, 2009 at 7:34 PM, N Hira wrote: > > Good summary, Shai. > > I've missed some of this thread as well, but does anyone know what happened > to the suggestion about query manipulation? > > e.g., query (about us) => query("abo

Re: Searching doubt

2009-08-04 Thread N Hira
t;creditcard") Regards, -h - Original Message From: Shai Erera To: java-user@lucene.apache.org Sent: Tuesday, August 4, 2009 10:31:46 AM Subject: Re: Searching doubt Hi Darren, The question was, how given a string "aboutus" in a document, you can return that document a

Re: Searching doubt

2009-08-04 Thread Matthew Hall
Well.. search on both anyhow. "about us" OR "aboutus" should hit the spot I think. Matt Ian Lea wrote: The question was, how given a string "aboutus" in a document, you can return that document as a result to the query "about us" (note the space). So we're mostly discussing how to detect and t

Re: Searching doubt

2009-08-04 Thread Ian Lea
> The question was, how given a string "aboutus" in a document, you can return > that document as a result to the query "about us" (note the space). So we're > mostly discussing how to detect and then break the word "aboutus" to two > words. I haven't really been following this thread so apologies

Re: Searching doubt

2009-08-04 Thread Shai Erera
Interesting ... I don't have access to a Japanese dictionary, so I just extract bi-grams. But I guess that in this case, if one can access an English dictionary (are you aware of an "open-source" one, or free one BTW?), one can use the method you mention. But still, doing this for every Token you

Re: Searching doubt

2009-08-04 Thread Phil Whelan
On Tue, Aug 4, 2009 at 8:31 AM, Shai Erera wrote: > Hi Darren, > > The question was, how given a string "aboutus" in a document, you can return > that document as a result to the query "about us" (note the space). So we're > mostly discussing how to detect and then break the word "aboutus" to two >

Re: Searching doubt

2009-08-04 Thread darren
A, ok. Interesting problem there as well. I'll think on that one some too! cheers. > Hi Darren, > > The question was, how given a string "aboutus" in a document, you can > return > that document as a result to the query "about us" (note the space). So > we're > mostly discussing how to detec

Re: Searching doubt

2009-08-04 Thread Shai Erera
Hi Darren, The question was, how given a string "aboutus" in a document, you can return that document as a result to the query "about us" (note the space). So we're mostly discussing how to detect and then break the word "aboutus" to two words. What you wrote though seems interesting as well, onl

Re: Searching doubt

2009-08-04 Thread darren
Just catching this thread, but if I understand what is being asked I can share how I do multi-word phrase matching. If that's not what's wanted, pardons! Ok, I load an entire dictionary into a lucene index, phrases and all. When I'm scanning some text, I do lookups in this dictionary index using

Re: Searching doubt

2009-08-04 Thread Phil Whelan
On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera wrote: > 2) Use a dictionary (real dictionary), and search it for every substring, > e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there. > This needs some fine tuning, like checking if the rest is also a word and if > the full strin

Re: Searching doubt

2009-08-04 Thread Shai Erera
If you don't know which tokens you'll face, then it's really a much harder problem. If you know where the token is, e.g. it's always in http://some.example.site/a/b//index.html, then it eases the task a bit. Otherwise you'll need to search every single token produced. I can think of several ways to

Re: Searching doubt

2009-08-04 Thread m.harig
Thanks , i've noticed that , but the code is for known tokens, how do i do it for dynamic tokens , meaning , i don't know the urls , someone picked up the urls and i'll index it. Is there any technique to use while indexing ? am using lucene 2.4.0 version. Please suggest me. -- Vie

Re: Searching doubt

2009-08-04 Thread Shai Erera
Well, if you have more cases like "aboutus", then I think the TokenFilter approach will help you. You should create your own Analyzer which receives another Analyzer as argument, and impl it's tokenStream() like this (it's the general idea): public TokenStream tokenStream(String fld, Reader reader

Re: Searching doubt

2009-08-04 Thread m.harig
Thanks for your reply, my original code snippet is IndexSearcher searcher = new IndexSearcher(indexDir); Analyzer analyzer = new StopAnalyzer(); BooleanClause.Occur[] flags = { BooleanClause.Occur.SHOULD, Boolea

Re: Searching doubt

2009-08-03 Thread Shai Erera
I don't see that you use the Analyzer anywhere (i.e. it's created by not used?). Also, the wildcard query you create may be very inefficient, as it will expand all the terms under the DEFAULT_FIELD. If the DEFAULT_FIELD is the field where all your "default searchable" terms are indexed, there coul

Re: Searching doubt

2009-08-03 Thread m.harig
Thanks This is my codw snippet IndexSearcher searcher = new IndexSearcher(indexDir); Analyzer analyzer = new StopAnalyzer(); WildcardQuery query = new WildcardQuery(new Term(DEFAULT_FIELD)); searcher.search(

Re: Searching doubt

2009-08-03 Thread Shai Erera
I can think of another approach - during indexing, capture the word "aboutus" and index it as "about us" and "aboutus" in the same position. That way both queries will work. You'd need to write your own TokenFilter, maybe a SynonymTokenFilter (since this reminds me of "synonyms" usage) that accept

Re: Searching doubt

2009-08-03 Thread Anshum
Hi Harig, What you are trying to do is search for 2 tokens as one. You'd have to index the url as you want for the token to be searchable. Else you might try a wildcard query . -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to m