Re: Creating CJK bigram tokens with ClassicTokenizer
Hi, Shawn Thank you for replying me. > CJKBigramFilter shouldn't care what tokenizer you're using. It should > work with any tokenizer. What problem are you seeing that you're trying > to solve? What version of Solr, what configuration, and what does it do > that you're not expecting, and what do you want it to do? I am sorry for lack of information. I tried this with Solr 5.5.5 and 7.5.0. And here is analyzer configuration from my managed-schema. And what I want to do is 1. to create CJ bigram token 2. to extract each word that contains a hyphen and stopwords as a single token (e.g. as-is, to-be, etc...) from CJK and English sentences. CJKBigramFilter seems to check TOKEN_TYPES attribute added by StandardTokenizer when creating CJK bigram token. (See https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java#L64 ) ClassicTokenizer also adds obsolete TOKEN_TYPES "CJ" to the CJ token and "ALPHANUM" to the Korean alphabet, but both are not targets for CJKBigramFilter... Thanks, Yasufumi 2018年10月2日(火) 0:05 Shawn Heisey : > On 9/30/2018 10:14 PM, Yasufumi Mizoguchi wrote: > > I am looking for the way to create CJK bigram tokens with > ClassicTokenizer. > > I tried this by using CJKBigramFilter, but it only supports for > > StandardTokenizer... > > CJKBigramFilter shouldn't care what tokenizer you're using. It should > work with any tokenizer. What problem are you seeing that you're trying > to solve? What version of Solr, what configuration, and what does it do > that you're not expecting, and what do you want it to do? > > I don't have access to the systems where I was using that filter, but if > I recall correctly, I was using the whitespace tokenizer. > > Thanks, > Shawn > >
Re: Creating CJK bigram tokens with ClassicTokenizer
On 9/30/2018 10:14 PM, Yasufumi Mizoguchi wrote: I am looking for the way to create CJK bigram tokens with ClassicTokenizer. I tried this by using CJKBigramFilter, but it only supports for StandardTokenizer... CJKBigramFilter shouldn't care what tokenizer you're using. It should work with any tokenizer. What problem are you seeing that you're trying to solve? What version of Solr, what configuration, and what does it do that you're not expecting, and what do you want it to do? I don't have access to the systems where I was using that filter, but if I recall correctly, I was using the whitespace tokenizer. Thanks, Shawn
Creating CJK bigram tokens with ClassicTokenizer
Hi, I am looking for the way to create CJK bigram tokens with ClassicTokenizer. I tried this by using CJKBigramFilter, but it only supports for StandardTokenizer... So, is there any good way to do that? Thanks, Yasufumi
Re: ClassicTokenizer
Hi Rick, Quoting Robert Muir’s comments on https://issues.apache.org/jira/browse/LUCENE-2167 (he’s referring to the word break rules in UAX#29[1] when he says “the standard”): > i actually am of the opinion StandardTokenizer should follow unicode standard > tokenization. then we can throw subjective decisions away, and stick with a > standard. > I think it would be really nice for StandardTokenizer to adhere straight to > the standard as much as we can with jflex [] Then its name would actually > make sense. [1] Unicode Standard Annex #29: Unicode Text Segmentation <http://unicode.org/reports/tr29/> -- Steve www.lucidworks.com > On Jan 10, 2018, at 10:09 PM, Shawn Heisey <apa...@elyograg.org> wrote: > > On 1/10/2018 2:27 PM, Rick Leir wrote: >> I did not express that clearly. >> The reference guide says "The Classic Tokenizer preserves the same behavior >> as the Standard Tokenizer of Solr versions 3.1 and previous. " >> So I am curious to know why they changed StandardTokenizer after 3.1 to >> break on hyphens, when it seems to me to work better the old way? > > I really have no idea. Those are Lucene classes, not Solr. Maybe someone > who was around for whatever discussions happened on Lucene lists back in > those days will comment. > > I wasn't able to find the issue where ClassicTokenizer was created, and I > couldn't find any information discussing the change. > > If I had to guess why StandardTokenizer was updated this way, I think it is > to accommodate searches where people were searching for one word in text > where that word was part of something larger with a hyphen, and it wasn't > being found. There was probably a discussion among the developers about what > a typical Lucene user would want, so they could decide what they would have > the standard tokenizer do. > > Likely because there was a vocal segment of the community reliant on the old > behavior, they preserved that behavior in ClassicTokenizer, but updated the > standard one to do what they felt would be normal for a typical user. > > Obviously *your* needs do not fall in line with what was decided ... so the > standard tokenizer isn't going to work for you. > > Thanks, > Shawn
Re: ClassicTokenizer
On 1/10/2018 2:27 PM, Rick Leir wrote: I did not express that clearly. The reference guide says "The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. " So I am curious to know why they changed StandardTokenizer after 3.1 to break on hyphens, when it seems to me to work better the old way? I really have no idea. Those are Lucene classes, not Solr. Maybe someone who was around for whatever discussions happened on Lucene lists back in those days will comment. I wasn't able to find the issue where ClassicTokenizer was created, and I couldn't find any information discussing the change. If I had to guess why StandardTokenizer was updated this way, I think it is to accommodate searches where people were searching for one word in text where that word was part of something larger with a hyphen, and it wasn't being found. There was probably a discussion among the developers about what a typical Lucene user would want, so they could decide what they would have the standard tokenizer do. Likely because there was a vocal segment of the community reliant on the old behavior, they preserved that behavior in ClassicTokenizer, but updated the standard one to do what they felt would be normal for a typical user. Obviously *your* needs do not fall in line with what was decided ... so the standard tokenizer isn't going to work for you. Thanks, Shawn
Re: ClassicTokenizer
Shawn I did not express that clearly. The reference guide says "The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. " So I am curious to know why they changed StandardTokenizer after 3.1 to break on hyphens, when it seems to me to work better the old way? Thanks Rick On January 9, 2018 7:07:59 PM EST, Shawn Heisey <apa...@elyograg.org> wrote: >On 1/9/2018 9:36 AM, Rick Leir wrote: >> A while ago the default was changed to StandardTokenizer from >ClassicTokenizer. The biggest difference seems to be that Classic does >not break on hyphens. There is also a different character pr(mumble). I >prefer the Classic's non-break on hyphens. > >To have any ability to research changes, we're going to need to know >precisely what you mean by "default" in that statement. > >Are you talking about the example schemas, or some kind of inherent >default when an analysis chain is not specified? > >Probably the reason for the change is an attempt to move into the >modern >era, become more standardized, and stop using old/legacy >implementations. The name of the new default contains the word >"Standard" which would fit in with that goal. > >I can't locate any changes in the last couple of years that change the >classic tokenizer to standard. Maybe I just don't know the right place >to look. > >> What was the reason for changing this default? If I understand this >better I can avoid some pitfalls, perhaps. > >If you are talking about example schemas, then the following may apply: > >Because you understand how analysis components work well enough to even >ask your question, I think you're probably the kind of admin who is >going to thoroughly customize the schema and not rely on the defaults >for TextField types that come with Solr. You're free to continue using >the classic tokenizer in your schema if that meets your needs better >than whatever changes are made to the examples by the devs. The >examples are only starting points, virtually all Solr installs require >customizing the schema. > >Thanks, >Shawn -- Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: ClassicTokenizer
On 1/9/2018 9:36 AM, Rick Leir wrote: > A while ago the default was changed to StandardTokenizer from > ClassicTokenizer. The biggest difference seems to be that Classic does not > break on hyphens. There is also a different character pr(mumble). I prefer > the Classic's non-break on hyphens. To have any ability to research changes, we're going to need to know precisely what you mean by "default" in that statement. Are you talking about the example schemas, or some kind of inherent default when an analysis chain is not specified? Probably the reason for the change is an attempt to move into the modern era, become more standardized, and stop using old/legacy implementations. The name of the new default contains the word "Standard" which would fit in with that goal. I can't locate any changes in the last couple of years that change the classic tokenizer to standard. Maybe I just don't know the right place to look. > What was the reason for changing this default? If I understand this better I > can avoid some pitfalls, perhaps. If you are talking about example schemas, then the following may apply: Because you understand how analysis components work well enough to even ask your question, I think you're probably the kind of admin who is going to thoroughly customize the schema and not rely on the defaults for TextField types that come with Solr. You're free to continue using the classic tokenizer in your schema if that meets your needs better than whatever changes are made to the examples by the devs. The examples are only starting points, virtually all Solr installs require customizing the schema. Thanks, Shawn
ClassicTokenizer
Hi all A while ago the default was changed to StandardTokenizer from ClassicTokenizer. The biggest difference seems to be that Classic does not break on hyphens. There is also a different character pr(mumble). I prefer the Classic's non-break on hyphens. What was the reason for changing this default? If I understand this better I can avoid some pitfalls, perhaps. Thanks -- Rick -- Sorry for being brief. Alternate email is rickleir at yahoo dot com
What we loose if we use ClassicTokenizer instead of StandardTokenizer
Hello, I need to know that if I use ClassicTokenizer instead of StandardTokenizer then what things I will loose. Is it the case that in future solr versions ClassicTokenizer will be deprecated? or development in ClassicTokenizer is going to halt? Please let me know this. -- View this message in context: http://lucene.472066.n3.nabble.com/What-we-loose-if-we-use-ClassicTokenizer-instead-of-StandardTokenizer-tp3990249.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What we loose if we use ClassicTokenizer instead of StandardTokenizer
You're asking us to predict the future, which if I could I'd be rich enough to build a mansion. If it's not marked as deprecated in 4.x or trunk, so it doesn't look like there's any plans to deprecate it. Although what the future holds is a good question.. I'd _strongly_ advise that you look at the admin/analysis page with the two tokenizers to get a feel for how they behave, I find a few minutes playing around there gives me a better sense of what's going on than descriptions... Best Erick On Tue, Jun 19, 2012 at 5:08 AM, Alok Bhandari alokomprakashbhand...@gmail.com wrote: Hello, I need to know that if I use ClassicTokenizer instead of StandardTokenizer then what things I will loose. Is it the case that in future solr versions ClassicTokenizer will be deprecated? or development in ClassicTokenizer is going to halt? Please let me know this. -- View this message in context: http://lucene.472066.n3.nabble.com/What-we-loose-if-we-use-ClassicTokenizer-instead-of-StandardTokenizer-tp3990249.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What we loose if we use ClassicTokenizer instead of StandardTokenizer
thanks for the reply. Yes I have started the admin/analysis thing before you suggested but just wanted to know if out of the box anything specific is notsupported/supported by the tokenizers specified. -- View this message in context: http://lucene.472066.n3.nabble.com/What-we-loose-if-we-use-ClassicTokenizer-instead-of-StandardTokenizer-tp3990249p3990278.html Sent from the Solr - User mailing list archive at Nabble.com.