RE: add CJKTokenizer to solr

2007-06-22 Thread Xuesong Luo
Thanks, Toru and Chris, I tried both the CJKTokenizer and CJKAnalyzer. Both return some unexpected highlight results when I tested with Germany. The field value I searched is Ein Mann beißt den Hund. The search criteria is beißt. When using CJKAnalyzer, beißt is treated as 2 single terms(bei

Re: Conceptual Question

2007-06-22 Thread Frédéric Glorieux
Hi Yonik, Sorry to jump on an old post There is a change interface in JIRA, as long as all of the fields originally sent are stored. Do you remember the JIRA issue, or a token to find it ? It sounds useful in some cases, for example, when you are working on analysers. That could be real

Re: add CJKTokenizer to solr

2007-06-22 Thread Daniel Alheiros
Hi Hoss. I've done a few tests using reflection to instantiate a simple object and the results will vary a lot depending on the JVM. As the JVM optimizes code as it is executed it will vary depending on the usage, but I think we have something to consider: If done 1,000 samples (5 clean X loop

Re: add CJKTokenizer to solr

2007-06-22 Thread Daniel Alheiros
Sorry I've confused things a bit... The thread safeness have to be considered only on the Tokenizers, not on the factories. So are the Tokenizers thread safe? Regards, Daniel On 22/6/07 11:36, Daniel Alheiros [EMAIL PROTECTED] wrote: Hi Hoss. I've done a few tests using reflection to

Re: add CJKTokenizer to solr

2007-06-22 Thread Otis Gospodnetic
Tokenizers are not thread safe (I made a mistake yesterday saying they are - I don't know what I was thinking). This is why: public abstract class Tokenizer extends TokenStream { /** The text source for this Tokenizer. */ protected Reader input; oops :(

Re: add CJKTokenizer to solr

2007-06-22 Thread Chris Hostetter
: Sorry I've confused things a bit... The thread safeness have to be : considered only on the Tokenizers, not on the factories. So are the : Tokenizers thread safe? nope ... they are constructed using Readers and mainting state about the text they are processing ... the only api is a next()

Re: add CJKTokenizer to solr

2007-06-22 Thread Mike Klaas
On 21-Jun-07, at 10:22 PM, Chris Hostetter wrote: like i said though: i'm in favore of factories like this ... i just don't think we should do anything to hide their use and make refering to Tokenizer or TOkenFilter class names directly use reflection magicly. What would be the best way

RE: Multi-language Tokenizers / Filters recommended?

2007-06-22 Thread Teruhiko Kurosaka
Hi Daniel, As you know, Chinese and Japanese does not use space or any other delimiters to break words. To overcome this problem, CJKTokenizer uses a method called bi-gram where the run of ideographic (=Chinese) characters are made into tokens of two neighboring characters. So a run of five

Re: page rank

2007-06-22 Thread Nick Jenkin
Hi David 1) you will have to re-add the documents, solr does not support an update operation (only add/del) 2) same as above, solr does not support an update operation, you will need to re-add the document with the updated numberField, if its any help I have a popularity field in my index (3

Re: Use Windows 1252 encoding...

2007-06-22 Thread Nick Jenkin
Have you tried using the PHP functions utf8_decode/utf8_encode? As far as I understand only UTF8 is supported (but I could be wrong on that!) -Nick On 6/23/07, escher2k [EMAIL PROTECTED] wrote: Is it possible to use Windows 1252 encoding instead of UTF-8 for Solr ? The application runs on

Re: Use Windows 1252 encoding...

2007-06-22 Thread Chris Hostetter
: Is it possible to use Windows 1252 encoding instead of UTF-8 for Solr ? The not at the moment... https://issues.apache.org/jira/browse/SOLR-96 -Hoss

Re: Conceptual Question

2007-06-22 Thread Chris Hostetter
: There is a change interface in JIRA, as long as all of the fields : originally sent are stored. : : Do you remember the JIRA issue, or a token to find it ? It sounds useful : in some cases, for example, when you are working on analysers. That : could be real life for me in future.