Re: Creating CJK bigram tokens with ClassicTokenizer

2018-10-03 Thread Yasufumi Mizoguchi
Hi, Shawn

Thank you for replying me.

> CJKBigramFilter shouldn't care what tokenizer you're using.  It should
> work with any tokenizer.  What problem are you seeing that you're trying
> to solve?  What version of Solr, what configuration, and what does it do
> that you're not expecting, and what do you want it to do?

I am sorry for lack of information. I tried this with Solr 5.5.5 and 7.5.0.
And here is analyzer configuration from my managed-schema.


  




  



And what I want to do is
1. to create CJ bigram token
2. to extract each word that contains a hyphen and stopwords as a single
token
   (e.g. as-is, to-be, etc...) from CJK and English sentences.

CJKBigramFilter seems to check TOKEN_TYPES attribute added by
StandardTokenizer when creating CJK bigram token.
(See
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java#L64
)

ClassicTokenizer also adds obsolete TOKEN_TYPES "CJ" to the CJ token and
"ALPHANUM" to the Korean alphabet, but both are not targets for
CJKBigramFilter...

Thanks,
Yasufumi

2018年10月2日(火) 0:05 Shawn Heisey :

> On 9/30/2018 10:14 PM, Yasufumi Mizoguchi wrote:
> > I am looking for the way to create CJK bigram tokens with
> ClassicTokenizer.
> > I tried this by using CJKBigramFilter, but it only supports for
> > StandardTokenizer...
>
> CJKBigramFilter shouldn't care what tokenizer you're using.  It should
> work with any tokenizer.  What problem are you seeing that you're trying
> to solve?  What version of Solr, what configuration, and what does it do
> that you're not expecting, and what do you want it to do?
>
> I don't have access to the systems where I was using that filter, but if
> I recall correctly, I was using the whitespace tokenizer.
>
> Thanks,
> Shawn
>
>


Re: Creating CJK bigram tokens with ClassicTokenizer

2018-10-01 Thread Shawn Heisey

On 9/30/2018 10:14 PM, Yasufumi Mizoguchi wrote:

I am looking for the way to create CJK bigram tokens with ClassicTokenizer.
I tried this by using CJKBigramFilter, but it only supports for
StandardTokenizer...


CJKBigramFilter shouldn't care what tokenizer you're using.  It should 
work with any tokenizer.  What problem are you seeing that you're trying 
to solve?  What version of Solr, what configuration, and what does it do 
that you're not expecting, and what do you want it to do?


I don't have access to the systems where I was using that filter, but if 
I recall correctly, I was using the whitespace tokenizer.


Thanks,
Shawn



Creating CJK bigram tokens with ClassicTokenizer

2018-09-30 Thread Yasufumi Mizoguchi
Hi,

I am looking for the way to create CJK bigram tokens with ClassicTokenizer.
I tried this by using CJKBigramFilter, but it only supports for
StandardTokenizer...

So, is there any good way to do that?

Thanks,
Yasufumi


Re: ClassicTokenizer

2018-01-11 Thread Steve Rowe
Hi Rick,

Quoting Robert Muir’s comments on 
https://issues.apache.org/jira/browse/LUCENE-2167 (he’s referring to the word 
break rules in UAX#29[1] when he says “the standard”):
 
> i actually am of the opinion StandardTokenizer should follow unicode standard 
> tokenization. then we can throw subjective decisions away, and stick with a 
> standard.

> I think it would be really nice for StandardTokenizer to adhere straight to 
> the standard as much as we can with jflex [] Then its name would actually 
> make sense.


[1] Unicode Standard Annex #29: Unicode Text Segmentation 
<http://unicode.org/reports/tr29/>

--
Steve
www.lucidworks.com

> On Jan 10, 2018, at 10:09 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> 
> On 1/10/2018 2:27 PM, Rick Leir wrote:
>> I did not express that clearly.
>> The reference guide says "The Classic Tokenizer preserves the same behavior 
>> as the Standard Tokenizer of Solr versions 3.1 and previous. "
>> So I am curious to know why they changed StandardTokenizer after 3.1 to 
>> break on hyphens, when it seems to me to work better the old way?
> 
> I really have no idea.  Those are Lucene classes, not Solr.  Maybe someone 
> who was around for whatever discussions happened on Lucene lists back in 
> those days will comment.
> 
> I wasn't able to find the issue where ClassicTokenizer was created, and I 
> couldn't find any information discussing the change.
> 
> If I had to guess why StandardTokenizer was updated this way, I think it is 
> to accommodate searches where people were searching for one word in text 
> where that word was part of something larger with a hyphen, and it wasn't 
> being found.  There was probably a discussion among the developers about what 
> a typical Lucene user would want, so they could decide what they would have 
> the standard tokenizer do.
> 
> Likely because there was a vocal segment of the community reliant on the old 
> behavior, they preserved that behavior in ClassicTokenizer, but updated the 
> standard one to do what they felt would be normal for a typical user.
> 
> Obviously *your* needs do not fall in line with what was decided ... so the 
> standard tokenizer isn't going to work for you.
> 
> Thanks,
> Shawn



Re: ClassicTokenizer

2018-01-10 Thread Shawn Heisey

On 1/10/2018 2:27 PM, Rick Leir wrote:

I did not express that clearly.
The reference guide says "The Classic Tokenizer preserves the same behavior as the 
Standard Tokenizer of Solr versions 3.1 and previous. "

So I am curious to know why they changed StandardTokenizer after 3.1 to break 
on hyphens, when it seems to me to work better the old way?


I really have no idea.  Those are Lucene classes, not Solr.  Maybe 
someone who was around for whatever discussions happened on Lucene lists 
back in those days will comment.


I wasn't able to find the issue where ClassicTokenizer was created, and 
I couldn't find any information discussing the change.


If I had to guess why StandardTokenizer was updated this way, I think it 
is to accommodate searches where people were searching for one word in 
text where that word was part of something larger with a hyphen, and it 
wasn't being found.  There was probably a discussion among the 
developers about what a typical Lucene user would want, so they could 
decide what they would have the standard tokenizer do.


Likely because there was a vocal segment of the community reliant on the 
old behavior, they preserved that behavior in ClassicTokenizer, but 
updated the standard one to do what they felt would be normal for a 
typical user.


Obviously *your* needs do not fall in line with what was decided ... so 
the standard tokenizer isn't going to work for you.


Thanks,
Shawn


Re: ClassicTokenizer

2018-01-10 Thread Rick Leir
Shawn
I did not express that clearly. 
The reference guide says "The Classic Tokenizer preserves the same behavior as 
the Standard Tokenizer of Solr versions 3.1 and previous. "

So I am curious to know why they changed StandardTokenizer after 3.1 to break 
on hyphens, when it seems to me to work better the old way?
Thanks
Rick

On January 9, 2018 7:07:59 PM EST, Shawn Heisey <apa...@elyograg.org> wrote:
>On 1/9/2018 9:36 AM, Rick Leir wrote:
>> A while ago the default was changed to StandardTokenizer from
>ClassicTokenizer. The biggest difference seems to be that Classic does
>not break on hyphens. There is also a different character pr(mumble). I
>prefer the Classic's non-break on hyphens.
>
>To have any ability to research changes, we're going to need to know
>precisely what you mean by "default" in that statement.
>
>Are you talking about the example schemas, or some kind of inherent
>default when an analysis chain is not specified?
>
>Probably the reason for the change is an attempt to move into the
>modern
>era, become more standardized, and stop using old/legacy
>implementations.  The name of the new default contains the word
>"Standard" which would fit in with that goal.
>
>I can't locate any changes in the last couple of years that change the
>classic tokenizer to standard.  Maybe I just don't know the right place
>to look.
>
>> What was the reason for changing this default? If I understand this
>better I can avoid some pitfalls, perhaps.
>
>If you are talking about example schemas, then the following may apply:
>
>Because you understand how analysis components work well enough to even
>ask your question, I think you're probably the kind of admin who is
>going to thoroughly customize the schema and not rely on the defaults
>for TextField types that come with Solr.  You're free to continue using
>the classic tokenizer in your schema if that meets your needs better
>than whatever changes are made to the examples by the devs.  The
>examples are only starting points, virtually all Solr installs require
>customizing the schema.
>
>Thanks,
>Shawn

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: ClassicTokenizer

2018-01-09 Thread Shawn Heisey
On 1/9/2018 9:36 AM, Rick Leir wrote:
> A while ago the default was changed to StandardTokenizer from 
> ClassicTokenizer. The biggest difference seems to be that Classic does not 
> break on hyphens. There is also a different character pr(mumble). I prefer 
> the Classic's non-break on hyphens.

To have any ability to research changes, we're going to need to know
precisely what you mean by "default" in that statement.

Are you talking about the example schemas, or some kind of inherent
default when an analysis chain is not specified?

Probably the reason for the change is an attempt to move into the modern
era, become more standardized, and stop using old/legacy
implementations.  The name of the new default contains the word
"Standard" which would fit in with that goal.

I can't locate any changes in the last couple of years that change the
classic tokenizer to standard.  Maybe I just don't know the right place
to look.

> What was the reason for changing this default? If I understand this better I 
> can avoid some pitfalls, perhaps.

If you are talking about example schemas, then the following may apply:

Because you understand how analysis components work well enough to even
ask your question, I think you're probably the kind of admin who is
going to thoroughly customize the schema and not rely on the defaults
for TextField types that come with Solr.  You're free to continue using
the classic tokenizer in your schema if that meets your needs better
than whatever changes are made to the examples by the devs.  The
examples are only starting points, virtually all Solr installs require
customizing the schema.

Thanks,
Shawn



ClassicTokenizer

2018-01-09 Thread Rick Leir
Hi all
A while ago the default was changed to StandardTokenizer from ClassicTokenizer. 
The biggest difference seems to be that Classic does not break on hyphens. 
There is also a different character pr(mumble). I prefer the Classic's 
non-break on hyphens. 

What was the reason for changing this default? If I understand this better I 
can avoid some pitfalls, perhaps.
Thanks -- Rick
-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

What we loose if we use ClassicTokenizer instead of StandardTokenizer

2012-06-19 Thread Alok Bhandari
Hello,

I need to know that if I use ClassicTokenizer instead of StandardTokenizer
then what things I will loose. Is it the case that in future solr versions
ClassicTokenizer will be deprecated? or development in ClassicTokenizer is
going to halt? Please let me know this.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/What-we-loose-if-we-use-ClassicTokenizer-instead-of-StandardTokenizer-tp3990249.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: What we loose if we use ClassicTokenizer instead of StandardTokenizer

2012-06-19 Thread Erick Erickson
You're asking us to predict the future, which if I could I'd be rich enough
to build a mansion. If it's not marked as deprecated in 4.x or trunk, so it
doesn't look like there's any plans to deprecate it. Although what the future
holds is a good question..

I'd _strongly_ advise that you look at the admin/analysis page with the two
tokenizers to get a feel for how they behave, I find a few minutes playing
around there gives me a better sense of what's going on than descriptions...

Best
Erick

On Tue, Jun 19, 2012 at 5:08 AM, Alok Bhandari
alokomprakashbhand...@gmail.com wrote:
 Hello,

 I need to know that if I use ClassicTokenizer instead of StandardTokenizer
 then what things I will loose. Is it the case that in future solr versions
 ClassicTokenizer will be deprecated? or development in ClassicTokenizer is
 going to halt? Please let me know this.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/What-we-loose-if-we-use-ClassicTokenizer-instead-of-StandardTokenizer-tp3990249.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: What we loose if we use ClassicTokenizer instead of StandardTokenizer

2012-06-19 Thread Alok Bhandari

thanks for the reply. Yes I have started the admin/analysis thing before you
suggested but just wanted to know if out of the box anything specific is
notsupported/supported by the tokenizers specified. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/What-we-loose-if-we-use-ClassicTokenizer-instead-of-StandardTokenizer-tp3990249p3990278.html
Sent from the Solr - User mailing list archive at Nabble.com.