subject:"Tokenizer or Filter \?"

Re: FW: Difference Between Tokenizer and filter

2016-03-03 Thread Jack Krupansky

Try re-reading the doc on "Understanding Analyzers, Tokenizers, and
Filters" and then ask specific questions on specific statements made in the
doc:
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters

As far as on-disk format, a Solr user has absolutely zero reason to be
concerned about what format Lucene uses to store the index on disk. You are
certainly welcome to dive down to that level if you wish, but that is not
something worth discussing on this list. To a Solr user the index is simply
a list of terms at positions, both determined by the character filters,
tokenizer, and token filters of the analyzer. The format of that
information as stored in Lucene won't impact the behavior of your Solr app
in any way.

Again, to be clear, you need to be thoroughly familiar with that doc
section. It won't help you to try to guess questions to ask if you don't
have a basic understanding of what is stated on that doc page.

It might also help you visualize what the doc says by using the analysis
page of the Solr admin UI which will give you all the intermediate and
final results of the analysis process, the specific token/term text and
position at each step. But even that won't help if you are unable to grasp
what is stated on the basic doc page.

-- Jack Krupansky

On Thu, Mar 3, 2016 at 8:51 AM, G, Rajesh <r...@cebglobal.com> wrote:

> Hi Shawn,
>
> One last question on analyzer. If the format of the index on disk is not
> controlled by the tokenizer, or anything else in the analysis chain, then
> what does type="index" and type="query" in analyzer mean. Can you please
> help me in understanding?
>
> 
>
>  
>  
>
>  
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
> -Original Message-
> From: G, Rajesh
> Sent: Thursday, March 3, 2016 6:12 PM
> To: 'solr-user@lucene.apache.org' <solr-user@lucene.apache.org>
> Subject: RE: FW: Difference Between Tokenizer and filter
>
> Thanks Shawn. This helps
>
> -Original Message-
> From: Shawn Heisey [mailto:apa...@elyograg.org]
> Sent: Wednesday, March 2, 2016 11:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: FW: Difference Between Tokenizer and filter
>
> On 3/2/2016 9:55 AM, G, Rajesh wrote:
> > Thanks for your email Koji. Can you please explain what is the role of
> tokenizer and filter so I can understand why I should not have two
> tokenizer in index and I should have at least one tokenizer in query?
>
> You can't have two tokenizers.  It's not allowed.
>
> The only notable difference between a Tokenizer and a Filter is that a
> Tokenizer operates on an input that's a single string, turning it into a
> token stream, and a Filter uses a token stream for both input and output.
> A CharFilter uses a single string as both input and output.
>
> An analysis chain in the Solr schema (whether it's index or query) is
> composed of zero or more CharFilter entries, exactly one Tokenizer entry,
> and zero or more Filter entries.  Alternately, you can specify an Analyzer
> class, which is a lot like a Tokenizer.  An Analyzer is effectively the
> same thing as a tokenizer combined with filters.
>
> CharFilters run before the Tokenizer, and Filters run after the
> Tokenizer.  CharFilters, Tokenizers, Filters, and Analyzers are Lucene
> concepts.
>
> > My understanding is tokenizer is used to say how the content should be
> > indexed physically in file system. Filters are used to query result
>
> The format of the index on disk is not controlled by the tokenizer, or
> anything else in the analysis chain.  It is controlled by the Lucene
> codec.  Only a very small part of the codec is configurable in Solr, but
> normally this does not need configuring.  The codec defaults are
> appropriate for the majority of use cases.
>
> Thanks,
> Shawn
>
>

RE: FW: Difference Between Tokenizer and filter

2016-03-03 Thread Vanlerberghe, Luc

The "index" type analyzer is used when documents are indexed and determines 
what tokens end up in the index.
The "query" type analyzer is used to analyze the user query and determines what 
tokens will be searched for.

As an example: If you want to be able to match on synonyms, you could have a 
"query" type analyzer that replaces each token in the users' query with the 
list of corresponding synonyms. The "index" type analyzer should just index the 
tokens as they are.

(If you have a fixed list of synonyms, both could map each token to a 
pre-defined 'canonical' synonym and save both index and query time)

Luc

-Original Message-
From: G, Rajesh [mailto:r...@cebglobal.com] 
Sent: donderdag 3 maart 2016 14:51
To: solr-user@lucene.apache.org
Subject: RE: FW: Difference Between Tokenizer and filter

Hi Shawn,

One last question on analyzer. If the format of the index on disk is not 
controlled by the tokenizer, or anything else in the analysis chain, then what 
does type="index" and type="query" in analyzer mean. Can you please help me in 
understanding?

Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: G, Rajesh
Sent: Thursday, March 3, 2016 6:12 PM
To: 'solr-user@lucene.apache.org' <solr-user@lucene.apache.org>
Subject: RE: FW: Difference Between Tokenizer and filter

Thanks Shawn. This helps

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Wednesday, March 2, 2016 11:04 PM
To: solr-user@lucene.apache.org
Subject: Re: FW: Difference Between Tokenizer and filter

On 3/2/2016 9:55 AM, G, Rajesh wrote:
> Thanks for your email Koji. Can you please explain what is the role of 
> tokenizer and filter so I can understand why I should not have two tokenizer 
> in index and I should have at least one tokenizer in query?

You can't have two tokenizers.  It's not allowed.

The only notable difference between a Tokenizer and a Filter is that a 
Tokenizer operates on an input that's a single string, turning it into a token 
stream, and a Filter uses a token stream for both input and output.  A 
CharFilter uses a single string as both input and output.

An analysis chain in the Solr schema (whether it's index or query) is composed 
of zero or more CharFilter entries, exactly one Tokenizer entry, and zero or 
more Filter entries.  Alternately, you can specify an Analyzer class, which is 
a lot like a Tokenizer.  An Analyzer is effectively the same thing as a 
tokenizer combined with filters.

CharFilters run before the Tokenizer, and Filters run after the Tokenizer.  
CharFilters, Tokenizers, Filters, and Analyzers are Lucene concepts.

> My understanding is tokenizer is used to say how the content should be
> indexed physically in file system. Filters are used to query result

The format of the index on disk is not controlled by the tokenizer, or anything 
else in the analysis chain.  It is controlled by the Lucene codec.  Only a very 
small part of the codec is configurable in Solr, but normally this does not 
need configuring.  The codec defaults are appropriate for the majority of use 
cases.

Thanks,
Shawn

RE: FW: Difference Between Tokenizer and filter

2016-03-03 Thread G, Rajesh

Hi Shawn,

One last question on analyzer. If the format of the index on disk is not 
controlled by the tokenizer, or anything else in the analysis chain, then what 
does type="index" and type="query" in analyzer mean. Can you please help me in 
understanding?

Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: G, Rajesh
Sent: Thursday, March 3, 2016 6:12 PM
To: 'solr-user@lucene.apache.org' <solr-user@lucene.apache.org>
Subject: RE: FW: Difference Between Tokenizer and filter

Thanks Shawn. This helps

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Wednesday, March 2, 2016 11:04 PM
To: solr-user@lucene.apache.org
Subject: Re: FW: Difference Between Tokenizer and filter

On 3/2/2016 9:55 AM, G, Rajesh wrote:
> Thanks for your email Koji. Can you please explain what is the role of 
> tokenizer and filter so I can understand why I should not have two tokenizer 
> in index and I should have at least one tokenizer in query?

You can't have two tokenizers.  It's not allowed.

The only notable difference between a Tokenizer and a Filter is that a 
Tokenizer operates on an input that's a single string, turning it into a token 
stream, and a Filter uses a token stream for both input and output.  A 
CharFilter uses a single string as both input and output.

An analysis chain in the Solr schema (whether it's index or query) is composed 
of zero or more CharFilter entries, exactly one Tokenizer entry, and zero or 
more Filter entries.  Alternately, you can specify an Analyzer class, which is 
a lot like a Tokenizer.  An Analyzer is effectively the same thing as a 
tokenizer combined with filters.

CharFilters run before the Tokenizer, and Filters run after the Tokenizer.  
CharFilters, Tokenizers, Filters, and Analyzers are Lucene concepts.

> My understanding is tokenizer is used to say how the content should be
> indexed physically in file system. Filters are used to query result

The format of the index on disk is not controlled by the tokenizer, or anything 
else in the analysis chain.  It is controlled by the Lucene codec.  Only a very 
small part of the codec is configurable in Solr, but normally this does not 
need configuring.  The codec defaults are appropriate for the majority of use 
cases.

Thanks,
Shawn

RE: FW: Difference Between Tokenizer and filter

2016-03-03 Thread G, Rajesh

Thanks Shawn. This helps

Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Wednesday, March 2, 2016 11:04 PM
To: solr-user@lucene.apache.org
Subject: Re: FW: Difference Between Tokenizer and filter

On 3/2/2016 9:55 AM, G, Rajesh wrote:
> Thanks for your email Koji. Can you please explain what is the role of 
> tokenizer and filter so I can understand why I should not have two tokenizer 
> in index and I should have at least one tokenizer in query?

You can't have two tokenizers.  It's not allowed.

The only notable difference between a Tokenizer and a Filter is that a 
Tokenizer operates on an input that's a single string, turning it into a token 
stream, and a Filter uses a token stream for both input and output.  A 
CharFilter uses a single string as both input and output.

An analysis chain in the Solr schema (whether it's index or query) is composed 
of zero or more CharFilter entries, exactly one Tokenizer entry, and zero or 
more Filter entries.  Alternately, you can specify an Analyzer class, which is 
a lot like a Tokenizer.  An Analyzer is effectively the same thing as a 
tokenizer combined with filters.

CharFilters run before the Tokenizer, and Filters run after the Tokenizer.  
CharFilters, Tokenizers, Filters, and Analyzers are Lucene concepts.

> My understanding is tokenizer is used to say how the content should be
> indexed physically in file system. Filters are used to query result

The format of the index on disk is not controlled by the tokenizer, or anything 
else in the analysis chain.  It is controlled by the Lucene codec.  Only a very 
small part of the codec is configurable in Solr, but normally this does not 
need configuring.  The codec defaults are appropriate for the majority of use 
cases.

Thanks,
Shawn

Re: FW: Difference Between Tokenizer and filter

2016-03-02 Thread Shawn Heisey

On 3/2/2016 9:55 AM, G, Rajesh wrote:
> Thanks for your email Koji. Can you please explain what is the role of 
> tokenizer and filter so I can understand why I should not have two tokenizer 
> in index and I should have at least one tokenizer in query?

You can't have two tokenizers.  It's not allowed.

The only notable difference between a Tokenizer and a Filter is that a
Tokenizer operates on an input that's a single string, turning it into a
token stream, and a Filter uses a token stream for both input and
output.  A CharFilter uses a single string as both input and output.

An analysis chain in the Solr schema (whether it's index or query) is
composed of zero or more CharFilter entries, exactly one Tokenizer
entry, and zero or more Filter entries.  Alternately, you can specify an
Analyzer class, which is a lot like a Tokenizer.  An Analyzer is
effectively the same thing as a tokenizer combined with filters.

CharFilters run before the Tokenizer, and Filters run after the
Tokenizer.  CharFilters, Tokenizers, Filters, and Analyzers are Lucene
concepts.

> My understanding is tokenizer is used to say how the content should be 
> indexed physically in file system. Filters are used to query result

The format of the index on disk is not controlled by the tokenizer, or
anything else in the analysis chain.  It is controlled by the Lucene
codec.  Only a very small part of the codec is configurable in Solr, but
normally this does not need configuring.  The codec defaults are
appropriate for the majority of use cases.

Thanks,
Shawn

RE: FW: Difference Between Tokenizer and filter

2016-03-02 Thread G, Rajesh

Thanks for your email Koji. Can you please explain what is the role of 
tokenizer and filter so I can understand why I should not have two tokenizer in 
index and I should have at least one tokenizer in query?

My understanding is tokenizer is used to say how the content should be indexed 
physically in file system. Filters are used to query result

Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Koji Sekiguchi [mailto:koji.sekigu...@rondhuit.com]
Sent: Wednesday, March 2, 2016 8:10 PM
To: solr-user@lucene.apache.org
Subject: Re: FW: Difference Between Tokenizer and filter

Hi,

... must have one and only one  and it can 
have zero or more s. From the point of view of the rules, your 
... is not correct because it has more than 
one  and  ... is not correct as 
well because it has no .

Koji

On 2016/03/02 20:25, G, Rajesh wrote:
> Hi Team,
>
> Can you please clarify the below. My understanding is tokenizer is used to 
> say how the content should be indexed physically in file system. Filters are 
> used to query result. The blow lines are from my setup. But I have seen eg 
> that include filters inside  and tokenizer in 
>  that confused me.
>
>   positionIncrementGap="100" >
>  
>  class="solr.LowerCaseTokenizerFactory"/>
>  class="solr.StandardTokenizerFactory"/>
>  class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
>  
>  
>  minGramSize="2" maxGramSize="2"/>
>  
>  
>
> My goal is to user solr and find the best match among the technology
> names e.g Actual tech name
>
> 1.   Microsoft Visual Studio
>
> 2.   Microsoft Internet Explorer
>
> 3.   Microsoft Visio
>
> When user types Microsoft Visal Studio user should get Microsoft
> Visual Studio. Basically misspelled and jumble words should match
> closest tech name
>
>
>
>
>
> Corporate Executive Board India Private Limited. Registration No: 
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> offer SHL Talent Measurement products and services. If you have received this 
> e-mail in error, please notify the sender and immediately, destroy all copies 
> of this email and its attachments. The publication, copying, in whole or in 
> part, or use or dissemination in any other way of this e-mail and attachments 
> by anyone other than the intended person(s) is prohibited.
>
>

Re: FW: Difference Between Tokenizer and filter

2016-03-02 Thread Koji Sekiguchi

Hi,

... must have one and only one and
it can have zero or more s. From the point of view of the
rules, your ... is not correct
because it has more than one and
... is not correct as well because it has no .

Koji

On 2016/03/02 20:25, G, Rajesh wrote:

Hi Team,

Can you please clarify the below. My understanding is tokenizer is used to say how the
content should be indexed physically in file system. Filters are used to query result. The
blow lines are from my setup. But I have seen eg that include filters inside and tokenizer in that confused me.

My goal is to user solr and find the best match among the technology names e.g
Actual tech name

1. Microsoft Visual Studio

2. Microsoft Internet Explorer

3. Microsoft Visio

When user types Microsoft Visal Studio user should get Microsoft Visual Studio.
Basically misspelled and jumble words should match closest tech name

Corporate Executive Board India Private Limited. Registration No:
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..

This e-mail and/or its attachments are intended only for the use of the
addressee(s) and may contain confidential and legally privileged information
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer
SHL Talent Measurement products and services. If you have received this e-mail
in error, please notify the sender and immediately, destroy all copies of this
email and its attachments. The publication, copying, in whole or in part, or
use or dissemination in any other way of this e-mail and attachments by anyone
other than the intended person(s) is prohibited.

FW: Difference Between Tokenizer and filter

2016-03-02 Thread G, Rajesh

Hi Team,

Can you please clarify the below. My understanding is tokenizer is used to say 
how the content should be indexed physically in file system. Filters are used 
to query result. The blow lines are from my setup. But I have seen eg that 
include filters inside  and tokenizer in  that confused me.



   
   
   


   



My goal is to user solr and find the best match among the technology names e.g
Actual tech name

1.   Microsoft Visual Studio

2.   Microsoft Internet Explorer

3.   Microsoft Visio

When user types Microsoft Visal Studio user should get Microsoft Visual Studio. 
Basically misspelled and jumble words should match closest tech name





Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

Re: FW: Difference Between Tokenizer and filter

2016-03-02 Thread Emir Arnautovic

Hi Rajesh,
Processing flow is same for both indexing and querying. What is compared
at the end are resulting tokens. In general flow is: text -> char filter
-> filtered text -> tokenizer -> tokens -> filter1 -> tokens ... ->
filterN -> tokens.

You can read more about analysis chain in Solr wiki:
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters

Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

On 02.03.2016 10:00, G, Rajesh wrote:

Hi Team,

My goal is to user solr and find the best match among the technology names e.g
Actual tech name

1. Microsoft Visual Studio

2. Microsoft Internet Explorer

3. Microsoft Visio

When user types Microsoft Visal Studio user should get Microsoft Visual Studio.
Basically misspelled and jumble words should match closest tech name

Corporate Executive Board India Private Limited. Registration No:
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..

FW: Difference Between Tokenizer and filter

2016-03-02 Thread G, Rajesh

Hi Team,

Can you please clarify the below. My understanding is tokenizer is used to say 
how the content should be indexed physically in file system. Filters are used 
to query result. The blow lines are from my setup. But I have seen eg that 
include filters inside  and tokenizer in  that confused me.



   
   
   


   



My goal is to user solr and find the best match among the technology names e.g
Actual tech name

1.   Microsoft Visual Studio

2.   Microsoft Internet Explorer

3.   Microsoft Visio

When user types Microsoft Visal Studio user should get Microsoft Visual Studio. 
Basically misspelled and jumble words should match closest tech name





Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

RE: Tokenizer and Filter Factory to index Chinese characters

2015-07-07 Thread Markus Jelsma

Yes, but it is a small change :)
M.

 
 
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Tuesday 7th July 2015 4:50
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
 So we have to recompile the analysers ourselves before we can use it in 5.x?
 
 Regards,
 Edwin
 
 On 6 July 2015 at 18:44, Markus Jelsma markus.jel...@openindex.io wrote:
 
  Yes, analyzers slightly changed since 5.x.
  https://issues.apache.org/jira/browse/LUCENE-5388
 
  -Original message-
   From:Zheng Lin Edwin Yeo edwinye...@gmail.com
   Sent: Monday 6th July 2015 12:31
   To: solr-user@lucene.apache.org
   Subject: Re: Tokenizer and Filter Factory to index Chinese characters
  
   Yes, I tried that also, but I faced some compatibility issues with Solr
   5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x
   versions.
  
   I got the following error when I tried to start Solr with Paoding
   configured:
  
   java.lang.VerifyError: class
   net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final
   method
  tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(Unknown Source)
 at java.security.SecureClassLoader.defineClass(Unknown Source)
 at java.net.URLClassLoader.defineClass(Unknown Source)
 at java.net.URLClassLoader.access$100(Unknown Source)
 at java.net.URLClassLoader$1.run(Unknown Source)
 at java.net.URLClassLoader$1.run(Unknown Source)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(Unknown Source)
 at
  org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
 at
  org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383)
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(Unknown Source)
 at java.security.SecureClassLoader.defineClass(Unknown Source)
 at java.net.URLClassLoader.defineClass(Unknown Source)
 at java.net.URLClassLoader.access$100(Unknown Source)
 at java.net.URLClassLoader$1.run(Unknown Source)
 at java.net.URLClassLoader$1.run(Unknown Source)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(Unknown Source)
 at
  org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Unknown Source)
 at
  org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476)
 at
  org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423)
 at
  org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262)
 at
  org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94)
 at
  org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42)
 at
  org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 at
  org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489)
 at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175)
 at
  org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
 at
  org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
 at
  org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102)
 at
  org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74)
 at
  org.apache.solr.core.CoreContainer.create(CoreContainer.java:516)
 at
  org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283)
 at
  org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
  Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
  Source)
 at java.lang.Thread.run(Unknown Source)
  
  
  
   Regards,
   Edwin
  
  
   2015-07-06 16:37 GMT+08:00 davidphilip cherian 
  davidphilipcher...@gmail.com
   :
  
Hi Edwin,
   
Have you tried the Paoding analyzer?  It is not out of the box shipped
  with
Solr jars. You may have to download it and add it to solr libs.
   
https

Re: Tokenizer and Filter Factory to index Chinese characters

2015-07-06 Thread Zheng Lin Edwin Yeo

 characters directly into the URL, the
  results I get are wrong.
 
  http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text
 
 
highlighting:{
 
  chinese1:{
 
text:[1月份的制造业产值同比仅增长0 \n \n   新加坡 我国1月份的制造业产值同比仅增长em0.9/em％。
  虽然制造业结束连续两个月的萎缩，但比经济师普遍预估的增长em3.3/em％疲软得多。这也意味着，我国今年第一季度的经济很可能让人失望 \n
  ]},
 
  chinese2:{
 
text:[Zheng emLin/em emYeo/em]},
 
  chinese3:{
 
text:[Zheng emLin/em emYeo/em]},
 
  chinese4:{
 
text:[户只要订购《联合晚报》任一种配套，就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值
  em199/em元的Lenovo emTAB/em 2
 A7-10七寸平板电脑，或者一架价值em249/em元的Philips
  Viva]},
 
  chinese5:{
 
text:[Zheng emLin/em emYeo/em]}}}
 
 
 
  Why is this so?
 
 
  Regards,
 
  Edwin
 
 
  2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:
 
   You may also want to try Paoding if you have enough time to spend:
   https://github.com/cslinmiso/paoding-analysis
  
   -Original message-
From:Zheng Lin Edwin Yeo edwinye...@gmail.com
Sent: Thursday 25th June 2015 11:38
To: solr-user@lucene.apache.org
Subject: Re: Tokenizer and Filter Factory to index Chinese characters
   
Hi, The result doesn't seems that good as well. But you're not using
  the
HMMChineseTokenizerFactory?
   
The output below is from the filters you've shown me.
   
  highlighting:{
chinese1:{
  id:[chinese1],
  title:[em我国/em1em月份的制造业产值同比仅增长/em0],
   
  
 
 content:[，em但比经济师普遍预估的增长/em3.3％em疲软得多/em。em这也意味着/em，em我国今年第一季度的经济很可能让人失望/em
\n  ],
  author:[emEdwin/em]},
chinese2:{
  id:[chinese2],
  content:[em铜牌/em，em让我国暂时高居奖牌荣誉榜榜首/em。
em你看好新加坡在本届的东运会中/em，em会夺得多少面金牌/em？
em请在/em6月em12/emem日中午前/em，em投票并留言为我国健将寄上祝语吧/em  \n
],
  author:[emEdwin/em]},
chinese3:{
  id:[chinese3],
  content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em，
   
  
 
 em以六局/em3963em总瓶分夺冠/em，em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em（Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em，em季军归菲律宾女队/em。（em联合早报记者/em：em郭嘉惠/em)
\n  ],
  author:[emEdwin/em]},
chinese4:{
  id:[chinese4],
  content:[，em则可获得一架价值/em309em元的/emPhilips Viva
Collection HD9045em面包机/em。 \n
em欲订从速/em，em读者可登陆/emwww.wbsub.com.sg，em或拨打客服专线/em6319
1800em订购/em。 \n
   
  
 
 em此外/em，em一年一度的晚报保健美容展/em，em将在本月/emem23/emem日和/emem24/em日，em在新达新加坡会展中心/em401、402em展厅举行/em。
\n
  
 
 em现场将开设/em《em联合晚报/em》em订阅展摊/em，em读者当场订阅晚报/em，em除了可获得丰厚的赠品/em，em还有机会参与/em“em必胜/em”em幸运抽奖/em],
  author:[emEdwin/em]}}}
   
   
Regards,
Edwin
   
   
2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io
 :
   
 Hi - we are actually using some other filters for Chinese, although
   they
 are not specialized for Chinese:

 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.CJKWidthFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.CJKBigramFilterFactory/


 -Original message-
  From:Zheng Lin Edwin Yeo edwinye...@gmail.com
  Sent: Thursday 25th June 2015 11:24
  To: solr-user@lucene.apache.org
  Subject: Re: Tokenizer and Filter Factory to index Chinese
  characters
 
  Thank you.
 
  I've tried that, but when I do a search, it's returning much more
  highlighted results that what it supposed to.
 
  For example, if I enter the following query:
  http://localhost:8983/solr/chinese1/highlight?q=我国
 
  I get the following results:
 
  highlighting:{
  chinese1:{
id:[chinese1],
 

  
 
 title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
 

  
 
 content:[em结束/emem连续/em两个月的em萎缩/em，但比经济师em普遍/emem预估/em的em增长/em3.3％em疲软/em得多。这也意味着，em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
  \n  ],
author:[emEdwin/em]},
  chinese2:{
id:[chinese2],
 

  
 
 content:[em铜牌/em，让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
  你看好新加坡在本届的东运会中，会em夺得/emem多少/em面em金牌/em？
 

  
 
 请在6月em12/em日em中午/em前，em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
   \n  ],
author:[emEdwin/em]},
  chinese3:{
id:[chinese3],
 

  
 
 content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中，
 

  
 
 以六局3963总瓶分em夺冠/em，为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦（Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em，em季军/em归菲律宾em女队/em。（em联合/emem早报/emem记者/em：郭嘉惠)
  \n  ],
author:[Edwin]},
  chinese4:{
id:[chinese4],
 

  
 
 content:[em配套/em的em读者/em，则可em获得/em一架em价值/em309元的Philips
  Viva Collection emHD/em9045面em包机/em。 \n
  欲订从速，em读者/em可em登陆/emwww.wbsub.com
 .emsg/em，或拨打客服em专线/em6319
  1800em订购/em。 \n
 

  
 
 em此外/em，一年一度的em晚报/emem保健/emem美容/em展，将在em本月/emem23/em日和em24/em日，在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。
  \n

  
 
 em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊，em读者/emem当场/emem订阅

RE: Tokenizer and Filter Factory to index Chinese characters

2015-07-06 Thread Markus Jelsma

Yes, analyzers slightly changed since 5.x.
https://issues.apache.org/jira/browse/LUCENE-5388
 
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Monday 6th July 2015 12:31
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
 Yes, I tried that also, but I faced some compatibility issues with Solr
 5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x
 versions.
 
 I got the following error when I tried to start Solr with Paoding
 configured:
 
 java.lang.VerifyError: class
 net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final
 method 
 tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
   at java.lang.ClassLoader.defineClass1(Native Method)
   at java.lang.ClassLoader.defineClass(Unknown Source)
   at java.security.SecureClassLoader.defineClass(Unknown Source)
   at java.net.URLClassLoader.defineClass(Unknown Source)
   at java.net.URLClassLoader.access$100(Unknown Source)
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(Unknown Source)
   at 
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
   at 
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383)
   at java.lang.ClassLoader.defineClass1(Native Method)
   at java.lang.ClassLoader.defineClass(Unknown Source)
   at java.security.SecureClassLoader.defineClass(Unknown Source)
   at java.net.URLClassLoader.defineClass(Unknown Source)
   at java.net.URLClassLoader.access$100(Unknown Source)
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.net.URLClassLoader$1.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(Unknown Source)
   at 
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
   at java.lang.ClassLoader.loadClass(Unknown Source)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Unknown Source)
   at 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476)
   at 
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423)
   at 
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262)
   at 
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94)
   at 
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42)
   at 
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
   at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489)
   at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175)
   at 
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
   at 
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
   at 
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102)
   at 
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74)
   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:516)
   at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283)
   at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277)
   at java.util.concurrent.FutureTask.run(Unknown Source)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   at java.lang.Thread.run(Unknown Source)
 
 
 
 Regards,
 Edwin
 
 
 2015-07-06 16:37 GMT+08:00 davidphilip cherian davidphilipcher...@gmail.com
 :
 
  Hi Edwin,
 
  Have you tried the Paoding analyzer?  It is not out of the box shipped with
  Solr jars. You may have to download it and add it to solr libs.
 
  https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding
 
 
 
  2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com:
 
   I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese
   characters can work when I use the Query tab in Solr Admin UI.
  
   In the Admin UI, it converts the Chinese characters to code before
  passing
   it to the URL, so it looks something like this:
  
  
  http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true
  
   highlighting:{
  
   chinese5:{
  
 text:[园将办系列活动庆祝入遗 \n

Re: Tokenizer and Filter Factory to index Chinese characters

2015-07-06 Thread davidphilip cherian

Hi Edwin,

Have you tried the Paoding analyzer?  It is not out of the box shipped with
Solr jars. You may have to download it and add it to solr libs.

https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding



2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com:

 I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese
 characters can work when I use the Query tab in Solr Admin UI.

 In the Admin UI, it converts the Chinese characters to code before passing
 it to the URL, so it looks something like this:

 http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true

 highlighting:{

 chinese5:{

   text:[园将办系列活动庆祝入遗 \n 从em胡姬花/em展到音
 乐会，为庆祝申遗成功，植物园这个月起将举办一系列活动与公众一同庆贺。
 本月10日开始的“新加坡植物园em胡姬/em及其文化遗产”展览，将展出1万
 6000株em胡姬花/em，这是]},

 chinese3:{

   text:[ \n 原版为 马来语 《Majulah Singapura》，中文译为《 前  进吧，新加坡 》。 \n  \n
 \t  国花 \n 新加坡以一种名为 卓  锦  ·  万代  兰
 的em胡姬花/em为国花。东南亚通称兰花为em胡姬花/em]}}}



 However, if I enter the Chinese characters directly into the URL, the
 results I get are wrong.

 http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text


   highlighting:{

 chinese1:{

   text:[1月份的制造业产值同比仅增长0 \n \n   新加坡 我国1月份的制造业产值同比仅增长em0.9/em％。
 虽然制造业结束连续两个月的萎缩，但比经济师普遍预估的增长em3.3/em％疲软得多。这也意味着，我国今年第一季度的经济很可能让人失望 \n
 ]},

 chinese2:{

   text:[Zheng emLin/em emYeo/em]},

 chinese3:{

   text:[Zheng emLin/em emYeo/em]},

 chinese4:{

   text:[户只要订购《联合晚报》任一种配套，就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值
 em199/em元的Lenovo emTAB/em 2 A7-10七寸平板电脑，或者一架价值em249/em元的Philips
 Viva]},

 chinese5:{

   text:[Zheng emLin/em emYeo/em]}}}



 Why is this so?


 Regards,

 Edwin


 2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:

  You may also want to try Paoding if you have enough time to spend:
  https://github.com/cslinmiso/paoding-analysis
 
  -Original message-
   From:Zheng Lin Edwin Yeo edwinye...@gmail.com
   Sent: Thursday 25th June 2015 11:38
   To: solr-user@lucene.apache.org
   Subject: Re: Tokenizer and Filter Factory to index Chinese characters
  
   Hi, The result doesn't seems that good as well. But you're not using
 the
   HMMChineseTokenizerFactory?
  
   The output below is from the filters you've shown me.
  
 highlighting:{
   chinese1:{
 id:[chinese1],
 title:[em我国/em1em月份的制造业产值同比仅增长/em0],
  
 
 content:[，em但比经济师普遍预估的增长/em3.3％em疲软得多/em。em这也意味着/em，em我国今年第一季度的经济很可能让人失望/em
   \n  ],
 author:[emEdwin/em]},
   chinese2:{
 id:[chinese2],
 content:[em铜牌/em，em让我国暂时高居奖牌荣誉榜榜首/em。
   em你看好新加坡在本届的东运会中/em，em会夺得多少面金牌/em？
   em请在/em6月em12/emem日中午前/em，em投票并留言为我国健将寄上祝语吧/em  \n
   ],
 author:[emEdwin/em]},
   chinese3:{
 id:[chinese3],
 content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em，
  
 
 em以六局/em3963em总瓶分夺冠/em，em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em（Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em，em季军归菲律宾女队/em。（em联合早报记者/em：em郭嘉惠/em)
   \n  ],
 author:[emEdwin/em]},
   chinese4:{
 id:[chinese4],
 content:[，em则可获得一架价值/em309em元的/emPhilips Viva
   Collection HD9045em面包机/em。 \n
   em欲订从速/em，em读者可登陆/emwww.wbsub.com.sg，em或拨打客服专线/em6319
   1800em订购/em。 \n
  
 
 em此外/em，em一年一度的晚报保健美容展/em，em将在本月/emem23/emem日和/emem24/em日，em在新达新加坡会展中心/em401、402em展厅举行/em。
   \n
 
 em现场将开设/em《em联合晚报/em》em订阅展摊/em，em读者当场订阅晚报/em，em除了可获得丰厚的赠品/em，em还有机会参与/em“em必胜/em”em幸运抽奖/em],
 author:[emEdwin/em]}}}
  
  
   Regards,
   Edwin
  
  
   2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:
  
Hi - we are actually using some other filters for Chinese, although
  they
are not specialized for Chinese:
   
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.CJKWidthFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.CJKBigramFilterFactory/
   
   
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Thursday 25th June 2015 11:24
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenizer and Filter Factory to index Chinese
 characters

 Thank you.

 I've tried that, but when I do a search, it's returning much more
 highlighted results that what it supposed to.

 For example, if I enter the following query:
 http://localhost:8983/solr/chinese1/highlight?q=我国

 I get the following results:

 highlighting:{
 chinese1:{
   id:[chinese1],

   
 
 title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],

   
 
 content:[em结束/emem连续/em两个月的em萎缩/em，但比经济师em普遍/emem预估/em的em增长/em3.3％em疲软/em得多。这也意味着，em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
 \n  ],
   author:[emEdwin/em]},
 chinese2:{
   id:[chinese2],

   
 
 content:[em铜牌/em，让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
 你看好新加坡在本届的东运会中，会em夺得/emem多少/em面em金牌/em？

   
 
 请在6月em12/em日em中午/em前，em投票/em并em留言/em为em我国

Re: Tokenizer and Filter Factory to index Chinese characters

2015-07-06 Thread Zheng Lin Edwin Yeo

So we have to recompile the analysers ourselves before we can use it in 5.x?

Regards,
Edwin

On 6 July 2015 at 18:44, Markus Jelsma markus.jel...@openindex.io wrote:

 Yes, analyzers slightly changed since 5.x.
 https://issues.apache.org/jira/browse/LUCENE-5388

 -Original message-
  From:Zheng Lin Edwin Yeo edwinye...@gmail.com
  Sent: Monday 6th July 2015 12:31
  To: solr-user@lucene.apache.org
  Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
  Yes, I tried that also, but I faced some compatibility issues with Solr
  5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x
  versions.
 
  I got the following error when I tried to start Solr with Paoding
  configured:
 
  java.lang.VerifyError: class
  net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final
  method
 tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$100(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
at
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$100(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at
 org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476)
at
 org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423)
at
 org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262)
at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94)
at
 org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42)
at
 org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
at
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175)
at
 org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
at
 org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
at
 org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102)
at
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74)
at
 org.apache.solr.core.CoreContainer.create(CoreContainer.java:516)
at
 org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283)
at
 org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
 Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
 Source)
at java.lang.Thread.run(Unknown Source)
 
 
 
  Regards,
  Edwin
 
 
  2015-07-06 16:37 GMT+08:00 davidphilip cherian 
 davidphilipcher...@gmail.com
  :
 
   Hi Edwin,
  
   Have you tried the Paoding analyzer?  It is not out of the box shipped
 with
   Solr jars. You may have to download it and add it to solr libs.
  
   https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding
  
  
  
   2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com:
  
I'm now using the solr.ICUTokenizerFactory, and the searching for
 Chinese
characters can work when I use the Query tab in Solr Admin UI.
   
In the Admin UI, it converts the Chinese characters to code

Re: Tokenizer and Filter Factory to index Chinese characters

2015-07-06 Thread Zheng Lin Edwin Yeo

I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese
characters can work when I use the Query tab in Solr Admin UI.

In the Admin UI, it converts the Chinese characters to code before passing
it to the URL, so it looks something like this:
http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true

highlighting:{

chinese5:{

  text:[园将办系列活动庆祝入遗 \n 从em胡姬花/em展到音
乐会，为庆祝申遗成功，植物园这个月起将举办一系列活动与公众一同庆贺。 本月10日开始的“新加坡植物园em胡姬/em及其文化遗产”展览，将展出1万
6000株em胡姬花/em，这是]},

chinese3:{

  text:[ \n 原版为 马来语 《Majulah Singapura》，中文译为《 前  进吧，新加坡 》。 \n  \n
\t  国花 \n 新加坡以一种名为 卓  锦  ·  万代  兰 的em胡姬花/em为国花。东南亚通称兰花为em胡姬花/em]}}}



However, if I enter the Chinese characters directly into the URL, the
results I get are wrong.

http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text


  highlighting:{

chinese1:{

  text:[1月份的制造业产值同比仅增长0 \n \n   新加坡 我国1月份的制造业产值同比仅增长em0.9/em％。
虽然制造业结束连续两个月的萎缩，但比经济师普遍预估的增长em3.3/em％疲软得多。这也意味着，我国今年第一季度的经济很可能让人失望 \n
]},

chinese2:{

  text:[Zheng emLin/em emYeo/em]},

chinese3:{

  text:[Zheng emLin/em emYeo/em]},

chinese4:{

  text:[户只要订购《联合晚报》任一种配套，就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值
em199/em元的Lenovo emTAB/em 2 A7-10七寸平板电脑，或者一架价值em249/em元的Philips
Viva]},

chinese5:{

  text:[Zheng emLin/em emYeo/em]}}}



Why is this so?


Regards,

Edwin


2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:

 You may also want to try Paoding if you have enough time to spend:
 https://github.com/cslinmiso/paoding-analysis

 -Original message-
  From:Zheng Lin Edwin Yeo edwinye...@gmail.com
  Sent: Thursday 25th June 2015 11:38
  To: solr-user@lucene.apache.org
  Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
  Hi, The result doesn't seems that good as well. But you're not using the
  HMMChineseTokenizerFactory?
 
  The output below is from the filters you've shown me.
 
highlighting:{
  chinese1:{
id:[chinese1],
title:[em我国/em1em月份的制造业产值同比仅增长/em0],
 
  
 content:[，em但比经济师普遍预估的增长/em3.3％em疲软得多/em。em这也意味着/em，em我国今年第一季度的经济很可能让人失望/em
  \n  ],
author:[emEdwin/em]},
  chinese2:{
id:[chinese2],
content:[em铜牌/em，em让我国暂时高居奖牌荣誉榜榜首/em。
  em你看好新加坡在本届的东运会中/em，em会夺得多少面金牌/em？
  em请在/em6月em12/emem日中午前/em，em投票并留言为我国健将寄上祝语吧/em  \n
  ],
author:[emEdwin/em]},
  chinese3:{
id:[chinese3],
content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em，
 
 em以六局/em3963em总瓶分夺冠/em，em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em（Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em，em季军归菲律宾女队/em。（em联合早报记者/em：em郭嘉惠/em)
  \n  ],
author:[emEdwin/em]},
  chinese4:{
id:[chinese4],
content:[，em则可获得一架价值/em309em元的/emPhilips Viva
  Collection HD9045em面包机/em。 \n
  em欲订从速/em，em读者可登陆/emwww.wbsub.com.sg，em或拨打客服专线/em6319
  1800em订购/em。 \n
 
 em此外/em，em一年一度的晚报保健美容展/em，em将在本月/emem23/emem日和/emem24/em日，em在新达新加坡会展中心/em401、402em展厅举行/em。
  \n
 em现场将开设/em《em联合晚报/em》em订阅展摊/em，em读者当场订阅晚报/em，em除了可获得丰厚的赠品/em，em还有机会参与/em“em必胜/em”em幸运抽奖/em],
author:[emEdwin/em]}}}
 
 
  Regards,
  Edwin
 
 
  2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:
 
   Hi - we are actually using some other filters for Chinese, although
 they
   are not specialized for Chinese:
  
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.CJKWidthFilterFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.CJKBigramFilterFactory/
  
  
   -Original message-
From:Zheng Lin Edwin Yeo edwinye...@gmail.com
Sent: Thursday 25th June 2015 11:24
To: solr-user@lucene.apache.org
Subject: Re: Tokenizer and Filter Factory to index Chinese characters
   
Thank you.
   
I've tried that, but when I do a search, it's returning much more
highlighted results that what it supposed to.
   
For example, if I enter the following query:
http://localhost:8983/solr/chinese1/highlight?q=我国
   
I get the following results:
   
highlighting:{
chinese1:{
  id:[chinese1],
   
  
 title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
   
  
 content:[em结束/emem连续/em两个月的em萎缩/em，但比经济师em普遍/emem预估/em的em增长/em3.3％em疲软/em得多。这也意味着，em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
\n  ],
  author:[emEdwin/em]},
chinese2:{
  id:[chinese2],
   
  
 content:[em铜牌/em，让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
你看好新加坡在本届的东运会中，会em夺得/emem多少/em面em金牌/em？
   
  
 请在6月em12/em日em中午/em前，em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
 \n  ],
  author:[emEdwin/em]},
chinese3:{
  id:[chinese3],
   
  
 content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中，
   
  
 以六局3963总瓶分em夺冠/em，为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦（Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em，em季军/em归菲律宾em女队/em。（em联合/emem早报/emem记者/em：郭嘉惠)
\n  ],
  author:[Edwin]},
chinese4:{
  id:[chinese4],
   
  
 content:[em配套

RE: Tokenizer and Filter Factory to index Chinese characters

2015-06-25 Thread Markus Jelsma

You may also want to try Paoding if you have enough time to spend:
https://github.com/cslinmiso/paoding-analysis
 
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Thursday 25th June 2015 11:38
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
 Hi, The result doesn't seems that good as well. But you're not using the
 HMMChineseTokenizerFactory?
 
 The output below is from the filters you've shown me.
 
   highlighting:{
 chinese1:{
   id:[chinese1],
   title:[em我国/em1em月份的制造业产值同比仅增长/em0],
   
 content:[，em但比经济师普遍预估的增长/em3.3％em疲软得多/em。em这也意味着/em，em我国今年第一季度的经济很可能让人失望/em
 \n  ],
   author:[emEdwin/em]},
 chinese2:{
   id:[chinese2],
   content:[em铜牌/em，em让我国暂时高居奖牌荣誉榜榜首/em。
 em你看好新加坡在本届的东运会中/em，em会夺得多少面金牌/em？
 em请在/em6月em12/emem日中午前/em，em投票并留言为我国健将寄上祝语吧/em  \n
 ],
   author:[emEdwin/em]},
 chinese3:{
   id:[chinese3],
   content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em，
 em以六局/em3963em总瓶分夺冠/em，em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em（Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em，em季军归菲律宾女队/em。（em联合早报记者/em：em郭嘉惠/em)
 \n  ],
   author:[emEdwin/em]},
 chinese4:{
   id:[chinese4],
   content:[，em则可获得一架价值/em309em元的/emPhilips Viva
 Collection HD9045em面包机/em。 \n
 em欲订从速/em，em读者可登陆/emwww.wbsub.com.sg，em或拨打客服专线/em6319
 1800em订购/em。 \n
 em此外/em，em一年一度的晚报保健美容展/em，em将在本月/emem23/emem日和/emem24/em日，em在新达新加坡会展中心/em401、402em展厅举行/em。
 \n 
 em现场将开设/em《em联合晚报/em》em订阅展摊/em，em读者当场订阅晚报/em，em除了可获得丰厚的赠品/em，em还有机会参与/em“em必胜/em”em幸运抽奖/em],
   author:[emEdwin/em]}}}
 
 
 Regards,
 Edwin
 
 
 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:
 
  Hi - we are actually using some other filters for Chinese, although they
  are not specialized for Chinese:
 
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.CJKWidthFilterFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.CJKBigramFilterFactory/
 
 
  -Original message-
   From:Zheng Lin Edwin Yeo edwinye...@gmail.com
   Sent: Thursday 25th June 2015 11:24
   To: solr-user@lucene.apache.org
   Subject: Re: Tokenizer and Filter Factory to index Chinese characters
  
   Thank you.
  
   I've tried that, but when I do a search, it's returning much more
   highlighted results that what it supposed to.
  
   For example, if I enter the following query:
   http://localhost:8983/solr/chinese1/highlight?q=我国
  
   I get the following results:
  
   highlighting:{
   chinese1:{
 id:[chinese1],
  
   title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
  
   
  content:[em结束/emem连续/em两个月的em萎缩/em，但比经济师em普遍/emem预估/em的em增长/em3.3％em疲软/em得多。这也意味着，em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
   \n  ],
 author:[emEdwin/em]},
   chinese2:{
 id:[chinese2],
  
   
  content:[em铜牌/em，让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
   你看好新加坡在本届的东运会中，会em夺得/emem多少/em面em金牌/em？
  
  请在6月em12/em日em中午/em前，em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
\n  ],
 author:[emEdwin/em]},
   chinese3:{
 id:[chinese3],
  
   
  content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中，
  
  以六局3963总瓶分em夺冠/em，为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦（Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em，em季军/em归菲律宾em女队/em。（em联合/emem早报/emem记者/em：郭嘉惠)
   \n  ],
 author:[Edwin]},
   chinese4:{
 id:[chinese4],
  
   content:[em配套/em的em读者/em，则可em获得/em一架em价值/em309元的Philips
   Viva Collection emHD/em9045面em包机/em。 \n
   欲订从速，em读者/em可em登陆/emwww.wbsub.com
  .emsg/em，或拨打客服em专线/em6319
   1800em订购/em。 \n
  
  em此外/em，一年一度的em晚报/emem保健/emem美容/em展，将在em本月/emem23/em日和em24/em日，在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。
   \n
  em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊，em读者/emem当场/emem订阅/emem晚报/em，em除了/em可em获得/emem丰厚/em的em赠品/em，还有em机会/emem参与/em“],
 author:[emEdwin/em]}}}
  
  
   Is there any suitable filter factory to solve this issue?
  
   I've tried WordDelimiterFilterFactory, PorterStemFilterFactory
   and StopFilterFactory, but there's no improvement in the search results.
  
  
   Regards,
   Edwin
  
  
   On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io
  wrote:
  
Hello - you can use HMMChineseTokenizerFactory instead.
   
   
  http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html
   
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Thursday 25th June 2015 11:02
 To: solr-user@lucene.apache.org
 Subject: Tokenizer and Filter Factory to index Chinese characters

 Hi,

 Does anyone knows what is the correct replacement for these 2
  tokenizer
and
 filter factory to index chinese into Solr?
 - SmartChineseSentenceTokenizerFactory
 - SmartChineseWordTokenFilterFactory

 I understand that these 2 tokenizer and filter factory are already
 deprecated in Solr 5.1

Tokenizer and Filter Factory to index Chinese characters

2015-06-25 Thread Zheng Lin Edwin Yeo

Hi,

Does anyone knows what is the correct replacement for these 2 tokenizer and
filter factory to index chinese into Solr?
- SmartChineseSentenceTokenizerFactory
- SmartChineseWordTokenFilterFactory

I understand that these 2 tokenizer and filter factory are already
deprecated in Solr 5.1, but I can't seem to find the correct replacement.


fieldType name=text_smartcn class=solr.TextField
positionIncrementGap=0
  analyzer type=index
tokenizer
class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
filter
class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer
class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
filter
class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
  /analyzer
/fieldType

Thank you.


Regards,
Edwin

Re: Tokenizer and Filter Factory to index Chinese characters

2015-06-25 Thread Zheng Lin Edwin Yeo

Thank you.

I've tried that, but when I do a search, it's returning much more
highlighted results that what it supposed to.

For example, if I enter the following query:
http://localhost:8983/solr/chinese1/highlight?q=我国

I get the following results:

highlighting:{
chinese1:{
  id:[chinese1],
  
title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
  
content:[em结束/emem连续/em两个月的em萎缩/em，但比经济师em普遍/emem预估/em的em增长/em3.3％em疲软/em得多。这也意味着，em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
\n  ],
  author:[emEdwin/em]},
chinese2:{
  id:[chinese2],
  
content:[em铜牌/em，让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
你看好新加坡在本届的东运会中，会em夺得/emem多少/em面em金牌/em？
请在6月em12/em日em中午/em前，em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
 \n  ],
  author:[emEdwin/em]},
chinese3:{
  id:[chinese3],
  
content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中，
以六局3963总瓶分em夺冠/em，为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦（Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em，em季军/em归菲律宾em女队/em。（em联合/emem早报/emem记者/em：郭嘉惠)
\n  ],
  author:[Edwin]},
chinese4:{
  id:[chinese4],
  content:[em配套/em的em读者/em，则可em获得/em一架em价值/em309元的Philips
Viva Collection emHD/em9045面em包机/em。 \n
欲订从速，em读者/em可em登陆/emwww.wbsub.com.emsg/em，或拨打客服em专线/em6319
1800em订购/em。 \n
em此外/em，一年一度的em晚报/emem保健/emem美容/em展，将在em本月/emem23/em日和em24/em日，在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。
\n 
em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊，em读者/emem当场/emem订阅/emem晚报/em，em除了/em可em获得/emem丰厚/em的em赠品/em，还有em机会/emem参与/em“],
  author:[emEdwin/em]}}}


Is there any suitable filter factory to solve this issue?

I've tried WordDelimiterFilterFactory, PorterStemFilterFactory
and StopFilterFactory, but there's no improvement in the search results.


Regards,
Edwin


On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote:

 Hello - you can use HMMChineseTokenizerFactory instead.

 http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html

 -Original message-
  From:Zheng Lin Edwin Yeo edwinye...@gmail.com
  Sent: Thursday 25th June 2015 11:02
  To: solr-user@lucene.apache.org
  Subject: Tokenizer and Filter Factory to index Chinese characters
 
  Hi,
 
  Does anyone knows what is the correct replacement for these 2 tokenizer
 and
  filter factory to index chinese into Solr?
  - SmartChineseSentenceTokenizerFactory
  - SmartChineseWordTokenFilterFactory
 
  I understand that these 2 tokenizer and filter factory are already
  deprecated in Solr 5.1, but I can't seem to find the correct replacement.
 
 
  fieldType name=text_smartcn class=solr.TextField
  positionIncrementGap=0
analyzer type=index
  tokenizer
 
 class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
  filter
 
 class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
/analyzer
analyzer type=query
  tokenizer
 
 class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
  filter
 
 class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
/analyzer
  /fieldType
 
  Thank you.
 
 
  Regards,
  Edwin

RE: Tokenizer and Filter Factory to index Chinese characters

2015-06-25 Thread Markus Jelsma

Hi - we are actually using some other filters for Chinese, although they are 
not specialized for Chinese:

tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.CJKWidthFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.CJKBigramFilterFactory/
 
 
-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Thursday 25th June 2015 11:24
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
 Thank you.
 
 I've tried that, but when I do a search, it's returning much more
 highlighted results that what it supposed to.
 
 For example, if I enter the following query:
 http://localhost:8983/solr/chinese1/highlight?q=我国
 
 I get the following results:
 
 highlighting:{
 chinese1:{
   id:[chinese1],
   
 title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
   
 content:[em结束/emem连续/em两个月的em萎缩/em，但比经济师em普遍/emem预估/em的em增长/em3.3％em疲软/em得多。这也意味着，em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
 \n  ],
   author:[emEdwin/em]},
 chinese2:{
   id:[chinese2],
   
 content:[em铜牌/em，让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
 你看好新加坡在本届的东运会中，会em夺得/emem多少/em面em金牌/em？
 请在6月em12/em日em中午/em前，em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
  \n  ],
   author:[emEdwin/em]},
 chinese3:{
   id:[chinese3],
   
 content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中，
 以六局3963总瓶分em夺冠/em，为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦（Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em，em季军/em归菲律宾em女队/em。（em联合/emem早报/emem记者/em：郭嘉惠)
 \n  ],
   author:[Edwin]},
 chinese4:{
   id:[chinese4],
   
 content:[em配套/em的em读者/em，则可em获得/em一架em价值/em309元的Philips
 Viva Collection emHD/em9045面em包机/em。 \n
 欲订从速，em读者/em可em登陆/emwww.wbsub.com.emsg/em，或拨打客服em专线/em6319
 1800em订购/em。 \n
 em此外/em，一年一度的em晚报/emem保健/emem美容/em展，将在em本月/emem23/em日和em24/em日，在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。
 \n 
 em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊，em读者/emem当场/emem订阅/emem晚报/em，em除了/em可em获得/emem丰厚/em的em赠品/em，还有em机会/emem参与/em“],
   author:[emEdwin/em]}}}
 
 
 Is there any suitable filter factory to solve this issue?
 
 I've tried WordDelimiterFilterFactory, PorterStemFilterFactory
 and StopFilterFactory, but there's no improvement in the search results.
 
 
 Regards,
 Edwin
 
 
 On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote:
 
  Hello - you can use HMMChineseTokenizerFactory instead.
 
  http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html
 
  -Original message-
   From:Zheng Lin Edwin Yeo edwinye...@gmail.com
   Sent: Thursday 25th June 2015 11:02
   To: solr-user@lucene.apache.org
   Subject: Tokenizer and Filter Factory to index Chinese characters
  
   Hi,
  
   Does anyone knows what is the correct replacement for these 2 tokenizer
  and
   filter factory to index chinese into Solr?
   - SmartChineseSentenceTokenizerFactory
   - SmartChineseWordTokenFilterFactory
  
   I understand that these 2 tokenizer and filter factory are already
   deprecated in Solr 5.1, but I can't seem to find the correct replacement.
  
  
   fieldType name=text_smartcn class=solr.TextField
   positionIncrementGap=0
 analyzer type=index
   tokenizer
  
  class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
   filter
  
  class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer
  
  class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
   filter
  
  class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
 /analyzer
   /fieldType
  
   Thank you.
  
  
   Regards,
   Edwin

RE: Tokenizer and Filter Factory to index Chinese characters

2015-06-25 Thread Markus Jelsma

Hello - you can use HMMChineseTokenizerFactory instead.
http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html

-Original message-
 From:Zheng Lin Edwin Yeo edwinye...@gmail.com
 Sent: Thursday 25th June 2015 11:02
 To: solr-user@lucene.apache.org
 Subject: Tokenizer and Filter Factory to index Chinese characters
 
 Hi,
 
 Does anyone knows what is the correct replacement for these 2 tokenizer and
 filter factory to index chinese into Solr?
 - SmartChineseSentenceTokenizerFactory
 - SmartChineseWordTokenFilterFactory
 
 I understand that these 2 tokenizer and filter factory are already
 deprecated in Solr 5.1, but I can't seem to find the correct replacement.
 
 
 fieldType name=text_smartcn class=solr.TextField
 positionIncrementGap=0
   analyzer type=index
 tokenizer
 class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
 filter
 class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer
 class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
 filter
 class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
   /analyzer
 /fieldType
 
 Thank you.
 
 
 Regards,
 Edwin

Re: Tokenizer and Filter Factory to index Chinese characters

2015-06-25 Thread Zheng Lin Edwin Yeo

Hi, The result doesn't seems that good as well. But you're not using the
HMMChineseTokenizerFactory?

The output below is from the filters you've shown me.

  highlighting:{
chinese1:{
  id:[chinese1],
  title:[em我国/em1em月份的制造业产值同比仅增长/em0],
  
content:[，em但比经济师普遍预估的增长/em3.3％em疲软得多/em。em这也意味着/em，em我国今年第一季度的经济很可能让人失望/em
\n  ],
  author:[emEdwin/em]},
chinese2:{
  id:[chinese2],
  content:[em铜牌/em，em让我国暂时高居奖牌荣誉榜榜首/em。
em你看好新加坡在本届的东运会中/em，em会夺得多少面金牌/em？
em请在/em6月em12/emem日中午前/em，em投票并留言为我国健将寄上祝语吧/em  \n
],
  author:[emEdwin/em]},
chinese3:{
  id:[chinese3],
  content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em，
em以六局/em3963em总瓶分夺冠/em，em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em（Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em，em季军归菲律宾女队/em。（em联合早报记者/em：em郭嘉惠/em)
\n  ],
  author:[emEdwin/em]},
chinese4:{
  id:[chinese4],
  content:[，em则可获得一架价值/em309em元的/emPhilips Viva
Collection HD9045em面包机/em。 \n
em欲订从速/em，em读者可登陆/emwww.wbsub.com.sg，em或拨打客服专线/em6319
1800em订购/em。 \n
em此外/em，em一年一度的晚报保健美容展/em，em将在本月/emem23/emem日和/emem24/em日，em在新达新加坡会展中心/em401、402em展厅举行/em。
\n 
em现场将开设/em《em联合晚报/em》em订阅展摊/em，em读者当场订阅晚报/em，em除了可获得丰厚的赠品/em，em还有机会参与/em“em必胜/em”em幸运抽奖/em],
  author:[emEdwin/em]}}}


Regards,
Edwin


2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io:

 Hi - we are actually using some other filters for Chinese, although they
 are not specialized for Chinese:

 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.CJKWidthFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.CJKBigramFilterFactory/


 -Original message-
  From:Zheng Lin Edwin Yeo edwinye...@gmail.com
  Sent: Thursday 25th June 2015 11:24
  To: solr-user@lucene.apache.org
  Subject: Re: Tokenizer and Filter Factory to index Chinese characters
 
  Thank you.
 
  I've tried that, but when I do a search, it's returning much more
  highlighted results that what it supposed to.
 
  For example, if I enter the following query:
  http://localhost:8983/solr/chinese1/highlight?q=我国
 
  I get the following results:
 
  highlighting:{
  chinese1:{
id:[chinese1],
 
  title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0],
 
  
 content:[em结束/emem连续/em两个月的em萎缩/em，但比经济师em普遍/emem预估/em的em增长/em3.3％em疲软/em得多。这也意味着，em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em
  \n  ],
author:[emEdwin/em]},
  chinese2:{
id:[chinese2],
 
  
 content:[em铜牌/em，让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。
  你看好新加坡在本届的东运会中，会em夺得/emem多少/em面em金牌/em？
 
 请在6月em12/em日em中午/em前，em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧
   \n  ],
author:[emEdwin/em]},
  chinese3:{
id:[chinese3],
 
  
 content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中，
 
 以六局3963总瓶分em夺冠/em，为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦（Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em，em季军/em归菲律宾em女队/em。（em联合/emem早报/emem记者/em：郭嘉惠)
  \n  ],
author:[Edwin]},
  chinese4:{
id:[chinese4],
 
  content:[em配套/em的em读者/em，则可em获得/em一架em价值/em309元的Philips
  Viva Collection emHD/em9045面em包机/em。 \n
  欲订从速，em读者/em可em登陆/emwww.wbsub.com
 .emsg/em，或拨打客服em专线/em6319
  1800em订购/em。 \n
 
 em此外/em，一年一度的em晚报/emem保健/emem美容/em展，将在em本月/emem23/em日和em24/em日，在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。
  \n
 em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊，em读者/emem当场/emem订阅/emem晚报/em，em除了/em可em获得/emem丰厚/em的em赠品/em，还有em机会/emem参与/em“],
author:[emEdwin/em]}}}
 
 
  Is there any suitable filter factory to solve this issue?
 
  I've tried WordDelimiterFilterFactory, PorterStemFilterFactory
  and StopFilterFactory, but there's no improvement in the search results.
 
 
  Regards,
  Edwin
 
 
  On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
   Hello - you can use HMMChineseTokenizerFactory instead.
  
  
 http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html
  
   -Original message-
From:Zheng Lin Edwin Yeo edwinye...@gmail.com
Sent: Thursday 25th June 2015 11:02
To: solr-user@lucene.apache.org
Subject: Tokenizer and Filter Factory to index Chinese characters
   
Hi,
   
Does anyone knows what is the correct replacement for these 2
 tokenizer
   and
filter factory to index chinese into Solr?
- SmartChineseSentenceTokenizerFactory
- SmartChineseWordTokenFilterFactory
   
I understand that these 2 tokenizer and filter factory are already
deprecated in Solr 5.1, but I can't seem to find the correct
 replacement.
   
   
fieldType name=text_smartcn class=solr.TextField
positionIncrementGap=0
  analyzer type=index
tokenizer
   
  
 class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/
filter
   
  
 class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/
  /analyzer
  analyzer type=query
tokenizer

Re: Tokenizer or Filter ?

2015-01-14 Thread Jack Krupansky

It's what Java has, whatever that is:
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

So, maybe the correct answer is neither, but similar to both.

-- Jack Krupansky

On Wed, Jan 14, 2015 at 9:06 AM, tomas.kalas kala...@email.cz wrote:

 Oh yeah, that is it. Thank you very much for your patience. And a last
 question at the end what type regEx Solr actually using ? POSIX or PCRE ?
 Thanks.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179505.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-14 Thread Jack Krupansky

It should replace all occurrences of the pattern. Post your specific filter
XML. Patterns can be very tricky.

Use the Solr Admin UI analysis page to see how the filtering is occurring.

-- Jack Krupansky

On Wed, Jan 14, 2015 at 7:16 AM, tomas.kalas kala...@email.cz wrote:

 Jack, thanks for help, but if i used PatternReplaceCharFilterFactory for
 example for this :
 d1text d1/d1d2text d2/d2d1text d1/d1d2text 2 ok/d2 then at
 output i only get segment d2text 2 ok/d2 when is d2 text d2/d2
 between marks d1 ./d1.d2.../d2 d1.../d1so the filter
 probably takes only first d1 and last d1 and if is something between it so
 the filter it don't skip it and replace it by space too, when i set at
 replacement space. So not better used the update processor ? If you are
 described it well in your book then i will buy it.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179477.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-14 Thread tomas.kalas

I just used Solr UI Analyzer for my test, or must i indexed it firstly?

I used this XML code in my schema: 

fieldType name=direction1 class=solr.TextField
positionIncrementGap=100
analyzer
  charFilter class=solr.PatternReplaceCharFilterFactory
  pattern=lt;d1gt;.*lt;/d1gt; replacement=/
  tokenizer class=solr.KeywordTokenizerFactory/
/analyzer
  /fieldType

This is my result:
http://lucene.472066.n3.nabble.com/file/n4179496/dir1.png 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179496.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-14 Thread Jack Krupansky

I was suspecting it might do that - the pattern is greedy and takes the
longest matching pattern. Add a question mark after the asterisk to use
stingy mode that matches the shortest pattern.

-- Jack Krupansky

On Wed, Jan 14, 2015 at 8:37 AM, tomas.kalas kala...@email.cz wrote:

 I just used Solr UI Analyzer for my test, or must i indexed it firstly?

 I used this XML code in my schema:

 fieldType name=direction1 class=solr.TextField
 positionIncrementGap=100
 analyzer
   charFilter class=solr.PatternReplaceCharFilterFactory
   pattern=lt;d1gt;.*lt;/d1gt; replacement=/
   tokenizer class=solr.KeywordTokenizerFactory/
 /analyzer
   /fieldType

 This is my result:
 http://lucene.472066.n3.nabble.com/file/n4179496/dir1.png



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179496.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-14 Thread tomas.kalas

Jack, thanks for help, but if i used PatternReplaceCharFilterFactory for
example for this :
d1text d1/d1d2text d2/d2d1text d1/d1d2text 2 ok/d2 then at
output i only get segment d2text 2 ok/d2 when is d2 text d2/d2
between marks d1 ./d1.d2.../d2 d1.../d1so the filter
probably takes only first d1 and last d1 and if is something between it so
the filter it don't skip it and replace it by space too, when i set at
replacement space. So not better used the update processor ? If you are
described it well in your book then i will buy it.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179477.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-14 Thread tomas.kalas

Oh yeah, that is it. Thank you very much for your patience. And a last
question at the end what type regEx Solr actually using ? POSIX or PCRE ?
Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179505.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-13 Thread Jack Krupansky

Actually, you may be able to get by using PatternReplaceCharFilterFactory -
copy the source value to two fields, one that treats d2.*/d2 as the
delimiter pattern to delete and then other uses d1.*/d1 as the
delimiter pattern to delete, so the first field has only d1 and then
second has only d2. You can use a second pattern char filter to remove
the [/]d[12 markers as well, probably changing them to a space in both
cases.

See:
http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html

-- Jack Krupansky

On Tue, Jan 13, 2015 at 11:40 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

Would it be sufficient for your user case to simply extract all the d1
into one field and all the d2 in another field? If so, the update
processor script would be very simple, simply matching all d1.*/d1
and copying them to a separate field value and same for d2.

If you want examples of script update processors, see my Solr e-book:

http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

On Tue, Jan 13, 2015 at 9:21 AM, tomas.kalas kala...@email.cz wrote:

Thanks Jack for your advice. Can you please explain me little more, how it
works? From Apache Wiki it's not to clear for me. I can write some
javaScript code when i want filtering some data ? In this case i have
d1bla bla bla/d1 d2 bla bla bla /d2 d1bla bla bla /d1 and i
want
filtering d2 bla bla bla /d2, But in other case i want filtering all
d1 /d1 then i suppose i used it at indexed data and filtering
from
them? Thanks

--
View this message in context:
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179173.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-13 Thread tomas.kalas

Thanks Jack for your advice. Can you please explain me little more, how it
works? From Apache Wiki it's not to clear for me. I can write some
javaScript code when i want filtering some data ? In this case i have
d1bla bla bla/d1 d2 bla bla bla /d2 d1bla bla bla /d1 and i want
filtering d2 bla bla bla /d2, But in other case i want filtering all
d1  /d1 then i suppose i used it at indexed data and filtering from
them? Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179173.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-13 Thread Jack Krupansky

Would it be sufficient for your user case to simply extract all the d1
into one field and all the d2 in another field? If so, the update
processor script would be very simple, simply matching all d1.*/d1
and copying them to a separate field value and same for d2.

If you want examples of script update processors, see my Solr e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

On Tue, Jan 13, 2015 at 9:21 AM, tomas.kalas kala...@email.cz wrote:

 Thanks Jack for your advice. Can you please explain me little more, how it
 works? From Apache Wiki it's not to clear for me. I can write some
 javaScript code when i want filtering some data ? In this case i have
 d1bla bla bla/d1 d2 bla bla bla /d2 d1bla bla bla /d1 and i
 want
 filtering d2 bla bla bla /d2, But in other case i want filtering all
 d1  /d1 then i suppose i used it at indexed data and filtering from
 them? Thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179173.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-09 Thread tomas.kalas

I'm used the same regex and it doesn't work unfortunately. Or should I
somehow change the regex? Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4178389.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-09 Thread Jack Krupansky

Consider an update processor - it can take any input, break it up any way
you want, and then output multiple field values.

You can even us the stateless script update processor to write the logic in
JavaScript.

-- Jack Krupansky

On Fri, Jan 9, 2015 at 6:47 AM, tomas.kalas kala...@email.cz wrote:

 Hello, i have a question what i have to use tokenizer or filter ?
 I need separate 2 chanels. I wrote this here earlier, but realize it with
 solr basic tools it is not probably possible. And i',m trying to write own
 tool for this task.
 I have this input d1Hello/d1d2Hello/d2d1How are you
 ?/d1d2Fine
 and you're?/d2 
 d1 - direction1
 d2 - direction2
 and i want to output only d1 and between this result search some words, for
 example output should be:
 Output: [d1Hello/d1,d1How are you?/d1d1/d1]

 I wrote my idea in java, but i dont know where  to incorporate it. If to
 Filter or Tokenizer and some advices how to start? I probably must extends
 some lucene library and include it easily modificated there isn't it ?

 Here is my code:

 package test1;
 import java.util.Arrays;

 public class Test1 {


 public static void main(String[] args) {
 String dialogue = d1Hello/d1d2Hello/d2d1How are you
 ?/d1d2Fine and you're?/d2 ;

 String[] input = dialogue.split((?=/d[12])\\d*(?=d[12]));

 int countD1 = 0;

 for (String input1 : input) {
 if (input1.startsWith(d1)) {
 countD1++;
 }
 }
 String [] d1 = new String[countD1];
 int array = 0;

 for (String input1 : input) {
 if (input1.startsWith(d1)) {
 d1[array] = input1;
 array++;
 }
 }
 String d1Out = Arrays.toString(d1);
 System.out.println(d1Out);
 //Return s1Out
  }
 }

 Thanks for you advices.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Tokenizer or Filter ?

2015-01-09 Thread tomas.kalas

Hello, i have a question what i have to use tokenizer or filter ?
I need separate 2 chanels. I wrote this here earlier, but realize it with
solr basic tools it is not probably possible. And i',m trying to write own
tool for this task.
I have this input d1Hello/d1d2Hello/d2d1How are you ?/d1d2Fine
and you're?/d2 
d1 - direction1
d2 - direction2
and i want to output only d1 and between this result search some words, for
example output should be:
Output: [d1Hello/d1,d1How are you?/d1d1/d1] 

I wrote my idea in java, but i dont know where  to incorporate it. If to
Filter or Tokenizer and some advices how to start? I probably must extends
some lucene library and include it easily modificated there isn't it ?

Here is my code:

package test1;
import java.util.Arrays;

public class Test1 {


public static void main(String[] args) {
String dialogue = d1Hello/d1d2Hello/d2d1How are you
?/d1d2Fine and you're?/d2 ;

String[] input = dialogue.split((?=/d[12])\\d*(?=d[12]));

int countD1 = 0;

for (String input1 : input) {
if (input1.startsWith(d1)) {
countD1++;
}
}
String [] d1 = new String[countD1];
int array = 0;

for (String input1 : input) {
if (input1.startsWith(d1)) {
d1[array] = input1;
array++;
}
}
String d1Out = Arrays.toString(d1);
System.out.println(d1Out); 
//Return s1Out
 }
}

Thanks for you advices. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-09 Thread Ahmet Arslan

Can't you use solr.PatternTokenizerFactory for this task?



On Friday, January 9, 2015 1:48 PM, tomas.kalas kala...@email.cz wrote:
Hello, i have a question what i have to use tokenizer or filter ?
I need separate 2 chanels. I wrote this here earlier, but realize it with
solr basic tools it is not probably possible. And i',m trying to write own
tool for this task.
I have this input d1Hello/d1d2Hello/d2d1How are you ?/d1d2Fine
and you're?/d2 
d1 - direction1
d2 - direction2
and i want to output only d1 and between this result search some words, for
example output should be:
Output: [d1Hello/d1,d1How are you?/d1d1/d1] 

I wrote my idea in java, but i dont know where  to incorporate it. If to
Filter or Tokenizer and some advices how to start? I probably must extends
some lucene library and include it easily modificated there isn't it ?

Here is my code:

package test1;
import java.util.Arrays;

public class Test1 {


public static void main(String[] args) {
String dialogue = d1Hello/d1d2Hello/d2d1How are you
?/d1d2Fine and you're?/d2 ;

String[] input = dialogue.split((?=/d[12])\\d*(?=d[12]));

int countD1 = 0;

for (String input1 : input) {
if (input1.startsWith(d1)) {
countD1++;
}
}
String [] d1 = new String[countD1];
int array = 0;

for (String input1 : input) {
if (input1.startsWith(d1)) {
d1[array] = input1;
array++;
}
}
String d1Out = Arrays.toString(d1);
System.out.println(d1Out); 
//Return s1Out
 }
}

Thanks for you advices. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: FW: Difference Between Tokenizer and filter

RE: FW: Difference Between Tokenizer and filter

RE: FW: Difference Between Tokenizer and filter

RE: FW: Difference Between Tokenizer and filter

Re: FW: Difference Between Tokenizer and filter

RE: FW: Difference Between Tokenizer and filter

Re: FW: Difference Between Tokenizer and filter

FW: Difference Between Tokenizer and filter

Re: FW: Difference Between Tokenizer and filter

FW: Difference Between Tokenizer and filter

RE: Tokenizer and Filter Factory to index Chinese characters

Re: Tokenizer and Filter Factory to index Chinese characters

RE: Tokenizer and Filter Factory to index Chinese characters

Re: Tokenizer and Filter Factory to index Chinese characters

Re: Tokenizer and Filter Factory to index Chinese characters

Re: Tokenizer and Filter Factory to index Chinese characters

RE: Tokenizer and Filter Factory to index Chinese characters

Tokenizer and Filter Factory to index Chinese characters

Re: Tokenizer and Filter Factory to index Chinese characters

RE: Tokenizer and Filter Factory to index Chinese characters

RE: Tokenizer and Filter Factory to index Chinese characters

Re: Tokenizer and Filter Factory to index Chinese characters

Re: Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Tokenizer or Filter ?

Re: Tokenizer or Filter ?

35 matches

Site Navigation

Mail list logo

Footer information