Re: Regex Help

2009-12-15 Thread Weiwei Wang
Same reason, i do not know which delimiter regex to use:-( On Wed, Dec 16, 2009 at 3:02 PM, Ghazal Gharooni wrote: > Hello, > Why don't you use String Tokenizer for splitting the result? > > > On Tue, Dec 15, 2009 at 9:45 PM, Weiwei Wang wrote: > > I want to split this parsed result string: name

Re: Regex Help

2009-12-15 Thread Ghazal Gharooni
Hello, Why don't you use String Tokenizer for splitting the result? On Tue, Dec 15, 2009 at 9:45 PM, Weiwei Wang wrote: > I want to split this parsed result string: name:"zhong guo" name:friend > server:172.16.65.79 > > into > > name:"zhong guo" > name:friend > server:172.16.65.79 > > how can I

Regex Help

2009-12-15 Thread Weiwei Wang
I want to split this parsed result string: name:"zhong guo" name:friend server:172.16.65.79 into name:"zhong guo" name:friend server:172.16.65.79 how can I write a regular pattern to do that? I'm not familiar with regex and tried a few patterns which didn't work -- Weiwei Wang Alex Wang 王巍巍 R

Re: How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Weiwei Wang
Thanks Robert, a lot is learned from you:-) On Wed, Dec 16, 2009 at 11:53 AM, Robert Muir wrote: > Hi, just one more thought for you. > > I think even more important than anything I said before, you should ensure > you implement reusableTokenStream in your analyzer. > this becomes a necessity if

Re: How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Robert Muir
Hi, just one more thought for you. I think even more important than anything I said before, you should ensure you implement reusableTokenStream in your analyzer. this becomes a necessity if you are using expensive objects like this. 2009/12/15 Weiwei Wang > Finally, i make it run, however, it w

Re: How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Robert Muir
hello, this was mainly to show you a quick-and-dirty way to solve the problem. if you have a lot of text, here are some ways to optimize: 1. the 'cleanup' step I showed you, is extremely inefficient way to remove the space and diacritics. For your case perhaps you can use more efficient ways to av

Re: How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Weiwei Wang
Finally, i make it run, however, it works so slow 2009/12/15 Weiwei Wang > got it, thanks, Robert > > > On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir wrote: > >> if you have lucene 2.9 or 3.0 source code, just run patch -p0 < >> /path/to/LUCENE-XXYY.patch from the lucene source code root direc

Re: Document category identification in query

2009-12-15 Thread fei liu
Query classification is an interesting question and there are many papers discussed this. For more infomation, you could refe these papers, "A taxonomy of web search", "Understanding user goal in web search", "Our winning solution to query classification in KDDCUP 2005". In your question, i think

Re: Document category identification in query

2009-12-15 Thread Weiwei Wang
I think you can do this with search suggestion like algorithms. First, you should categorize the search log, e.g. Thai Restaurant or Chinese Restaurant or KFC should be assigned categories including Restaurant. When user is typing, figure out from the search log which keyword is nearest to the in

Re: Document category identification in query

2009-12-15 Thread Alex
Can anybody help me or maybe point me to relevant resources I could learn from ? Thanks.

答复: Re: Tokenized fields in Lucene 3.0.0

2009-12-15 Thread 王巍巍
Check your field typo first - 原始邮件 - 发件人: Michel Nadeau 发送时间: 2009年12月16日 星期三 4:48 收件人: java-user@lucene.apache.org 主题: Re: Tokenized fields in Lucene 3.0.0 I search like this - IndexReader reader = IndexReader.open(idx, true); IndexSearcher searcher = new IndexSearcher(reader); Que

Re: Tokenized fields in Lucene 3.0.0

2009-12-15 Thread Erick Erickson
Thanks for bringing closure, this was scaring me ... On Tue, Dec 15, 2009 at 4:31 PM, Michel Nadeau wrote: > Forget it - I found the problem. There was an escaping problem on the > search-client side. > > Sorry about that. > > - Mike > aka...@gmail.com > > > On Tue, Dec 15, 2009 at 3:48 PM, Mich

Re: Tokenized fields in Lucene 3.0.0

2009-12-15 Thread Michel Nadeau
Forget it - I found the problem. There was an escaping problem on the search-client side. Sorry about that. - Mike aka...@gmail.com On Tue, Dec 15, 2009 at 3:48 PM, Michel Nadeau wrote: > I search like this - > > IndexReader reader = IndexReader.open(idx, true); > IndexSearcher searcher =

Re: Tokenized fields in Lucene 3.0.0

2009-12-15 Thread Michel Nadeau
I search like this - IndexReader reader = IndexReader.open(idx, true); IndexSearcher searcher = new IndexSearcher(reader); QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "content", cluStdAn); // StandardAnalyzer q = parser.parse(QUERY); TopDocs td = searcher.search(q, cluCF,

Re: Tokenized fields in Lucene 3.0.0

2009-12-15 Thread Mark Miller
Any more info to share? In 2.9, Tokenized literally == Analyzed. /** @deprecated this has been renamed to {...@link #ANALYZED} */ public static final Index TOKENIZED = ANALYZED; Michel Nadeau wrote: > Hi, > > I just realized that since I upgraded from Lucene 2.x to 3.0.0 (and removed > a

Tokenized fields in Lucene 3.0.0

2009-12-15 Thread Michel Nadeau
Hi, I just realized that since I upgraded from Lucene 2.x to 3.0.0 (and removed all deprecated things), searches like that don't work anymore: test AND blue test NOT blue (test AND blue) OR red etc. Before 3.0.0, I was inserting my fields like this: doc.add(new Field("content", sValues[j], Fiel

Re: matching products with suggest feature

2009-12-15 Thread eddiec
I haven't used Lucene are read the Lucene book in quite a while since I handed over my university thesis quite a few years ago. However I'm currently building an ecommerce site from an asp skeleton, the current search and recommendation algorithms are built on limited SQL searches but I'd like to

Re: How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Weiwei Wang
got it, thanks, Robert On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir wrote: > if you have lucene 2.9 or 3.0 source code, just run patch -p0 < > /path/to/LUCENE-XXYY.patch from the lucene source code root directory... it > should create the necessary directory and files. > then run 'ant' , in th

Re: How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Erick Erickson
If you're using an IDE, there should be an "apply patch" somewhere. In Eclipse, you right-click on the project>>team>>apply patch. In IntelliJ, it's something like Version Control>>(subversion???)>>apply patch Or do as Robert suggests from the command line... HTH Erick On Tue, Dec 15, 2009

Re: Looking for a MappingCharFilter that accepts regular expressions

2009-12-15 Thread Koji Sekiguchi
Paul Taylor wrote: CharStream.Found it at http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/analysis/PatternReplaceFilter.java?revision=804726&view=markup, BTW why not ad this to the Lucene coebase rather than solr code base. Unfortunately it doesn't address my problem be

Re: How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Robert Muir
if you have lucene 2.9 or 3.0 source code, just run patch -p0 < /path/to/LUCENE-XXYY.patch from the lucene source code root directory... it should create the necessary directory and files. then run 'ant' , in this case it should create a lucene-icu jar file in the build directory. the patch doesn

Re: How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Weiwei Wang
Yes, i found the patch file LUCENE-1488.patch and there's no icu directory in my dowloaded contrib directory. I'm a rookie guy using patch, i'm currently in the contrib dir, could anybody tell me how to execute this patch command to generate the relevant dir and souce files? On Tue, Dec 15, 2009

Re: How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Robert Muir
look at the latest patch file attached to the issue, it should work with lucene 2.9 or greater (I think) 2009/12/15 Weiwei Wang > where can i find the source code? > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir wrote: > > > there is an icu transform tokenfilter in the patch here: > > http://i

Re: How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Weiwei Wang
where can i find the source code? On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir wrote: > there is an icu transform tokenfilter in the patch here: > http://issues.apache.org/jira/browse/LUCENE-1488 > >Transliterator pinyin = Transliterator.getInstance("Han-Latin"); >Tokenizer tokenizer = n

Re: How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Robert Muir
there is an icu transform tokenfilter in the patch here: http://issues.apache.org/jira/browse/LUCENE-1488 Transliterator pinyin = Transliterator.getInstance("Han-Latin"); Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国")); ICUTransformFilter filter = new ICUTransformFil

How to do alias(Pinyin) search in Lucene

2009-12-15 Thread Weiwei Wang
Hi, guys, I'm implementing a search engine based on Lucene for Chinese. So I want to support pinyin search as Google China do. e.g. “中国” means Chinese in English this word's pinyin input is "zhongguo" The feature i want to implement is when user type zhongguo the results will include

Re: I need to implement a TokenFilter to break season07

2009-12-15 Thread Weiwei Wang
WordDelimiterFilter is implemented in an old version where nextToken is called On Tue, Dec 15, 2009 at 7:17 PM, Koji Sekiguchi wrote: > Weiwei Wang wrote: > >> Hi, all >> I currently need a TokenFilter to break token season07 into two >> tokens >> season 07 >> >> >> > I'd recommend you to r

RE: I need to implement a TokenFilter to break season07

2009-12-15 Thread Uwe Schindler
> > And if you do it yourself, don't forget to call clearAttributes() > whenever > > you produce new tokens (else you may have bugs in the token increments). > In > > the old token api its Token.clear()... Just a warning! > > This comment has worried me, is this ok or am i meant to call > clearAtt

Re: Lucene Analyzer that can handle C++ vs C#

2009-12-15 Thread Weiwei Wang
KeywordAnalyzer can not handle a whole complete sentence. On Tue, Dec 15, 2009 at 7:33 PM, Ganesh wrote: > How about KeywordAnalyzer? It will treat C++ and C# as single term. > > Regards > Ganesh > > - Original Message - > From: "Chris Lu" > To: > Sent: Saturday, December 12, 2009 5:27

Re: Search correction

2009-12-15 Thread Weiwei Wang
Thanks Simon, i used a similar approach as SpellChecker to do search suggestion. I'll try that in search correction. On Tue, Dec 15, 2009 at 7:30 PM, Simon Willnauer < simon.willna...@googlemail.com> wrote: > Weiwei, > Lucene Contrib offers a Spellchecker package that might help you with > your a

Re: I need to implement a TokenFilter to break season07

2009-12-15 Thread Paul Taylor
Uwe Schindler wrote: And if you do it yourself, don't forget to call clearAttributes() whenever you produce new tokens (else you may have bugs in the token increments). In the old token api its Token.clear()... Just a warning! This comment has worried me, is this ok or am i meant to call clear

Re: Lucene Analyzer that can handle C++ vs C#

2009-12-15 Thread Ganesh
How about KeywordAnalyzer? It will treat C++ and C# as single term. Regards Ganesh - Original Message - From: "Chris Lu" To: Sent: Saturday, December 12, 2009 5:27 AM Subject: Re: Lucene Analyzer that can handle C++ vs C# > What we did in DBSight is to provide a reserved list of wor

Re: Search correction

2009-12-15 Thread Simon Willnauer
Weiwei, Lucene Contrib offers a Spellchecker package that might help you with your application. The spellchecker takes a dictionary of terms (build from your search index or from some other text resource) and builds a suggestion index from those terms. Internally terms are indexed as ngrams. You ca

RE: SnowballAnalyzer and StopAnalyzer.ENGLISH_STOP_WORDS_SET ?

2009-12-15 Thread Uwe Schindler
Have seen it! It is easy to implement. Thanks! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Nick Burch [mailto:n...@torchbox.com] > Sent: Tuesday, December 15, 2009 12:23 PM > To: java-user@lucene.apa

RE: SnowballAnalyzer and StopAnalyzer.ENGLISH_STOP_WORDS_SET ?

2009-12-15 Thread Nick Burch
On Mon, 14 Dec 2009, Uwe Schindler wrote: Can you open an issue? This is a problem in SnowballAnalyzer missing to add the set ctor. Sure, I have done - http://issues.apache.org/jira/browse/LUCENE-2165 Nick - To unsubscribe, e

RE: I need to implement a TokenFilter to break season07

2009-12-15 Thread Uwe Schindler
And if you do it yourself, don't forget to call clearAttributes() whenever you produce new tokens (else you may have bugs in the token increments). In the old token api its Token.clear()... Just a warning! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@

Re: I need to implement a TokenFilter to break season07

2009-12-15 Thread Koji Sekiguchi
Weiwei Wang wrote: Hi, all I currently need a TokenFilter to break token season07 into two tokens season 07 I'd recommend you to refer WordDelimiterFilter in Solr. Koji -- http://www.rondhuit.com/en/ - To unsubscri

Re: Search correction

2009-12-15 Thread Chris Were
I recall reading Google does it based on statistical analysis of what words users type. For example if I search for "googl" and then my next search is for "google" that is stored. Next time someone types "googl", "google" is suggested. Sorry, I don't have a source to link you to on that though. C

Re: Scoring formula - Average number of terms in IDF

2009-12-15 Thread kdev
any ideas please? -- View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsu

I need to implement a TokenFilter to break season07

2009-12-15 Thread Weiwei Wang
Hi, all I currently need a TokenFilter to break token season07 into two tokens season 07 I tried PatternReplaceCharFilter to replace "season07" with "season 07", however, the offset is not correct for Highlighting. For this reason, I want to implement a TokenFilter, but I do not know how to