Re: FW: Difference Between Tokenizer and filter
Try re-reading the doc on "Understanding Analyzers, Tokenizers, and Filters" and then ask specific questions on specific statements made in the doc: https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters As far as on-disk format, a Solr user has absolutely zero reason to be concerned about what format Lucene uses to store the index on disk. You are certainly welcome to dive down to that level if you wish, but that is not something worth discussing on this list. To a Solr user the index is simply a list of terms at positions, both determined by the character filters, tokenizer, and token filters of the analyzer. The format of that information as stored in Lucene won't impact the behavior of your Solr app in any way. Again, to be clear, you need to be thoroughly familiar with that doc section. It won't help you to try to guess questions to ask if you don't have a basic understanding of what is stated on that doc page. It might also help you visualize what the doc says by using the analysis page of the Solr admin UI which will give you all the intermediate and final results of the analysis process, the specific token/term text and position at each step. But even that won't help if you are unable to grasp what is stated on the basic doc page. -- Jack Krupansky On Thu, Mar 3, 2016 at 8:51 AM, G, Rajesh <r...@cebglobal.com> wrote: > Hi Shawn, > > One last question on analyzer. If the format of the index on disk is not > controlled by the tokenizer, or anything else in the analysis chain, then > what does type="index" and type="query" in analyzer mean. Can you please > help me in understanding? > > > > > > > > > > > Corporate Executive Board India Private Limited. Registration No: > U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building > No.10 DLF Cyber City, Gurgaon, Haryana-122002, India. > > This e-mail and/or its attachments are intended only for the use of the > addressee(s) and may contain confidential and legally privileged > information belonging to CEB and/or its subsidiaries, including CEB > subsidiaries that offer SHL Talent Measurement products and services. If > you have received this e-mail in error, please notify the sender and > immediately, destroy all copies of this email and its attachments. The > publication, copying, in whole or in part, or use or dissemination in any > other way of this e-mail and attachments by anyone other than the intended > person(s) is prohibited. > > -Original Message- > From: G, Rajesh > Sent: Thursday, March 3, 2016 6:12 PM > To: 'solr-user@lucene.apache.org' <solr-user@lucene.apache.org> > Subject: RE: FW: Difference Between Tokenizer and filter > > Thanks Shawn. This helps > > -Original Message- > From: Shawn Heisey [mailto:apa...@elyograg.org] > Sent: Wednesday, March 2, 2016 11:04 PM > To: solr-user@lucene.apache.org > Subject: Re: FW: Difference Between Tokenizer and filter > > On 3/2/2016 9:55 AM, G, Rajesh wrote: > > Thanks for your email Koji. Can you please explain what is the role of > tokenizer and filter so I can understand why I should not have two > tokenizer in index and I should have at least one tokenizer in query? > > You can't have two tokenizers. It's not allowed. > > The only notable difference between a Tokenizer and a Filter is that a > Tokenizer operates on an input that's a single string, turning it into a > token stream, and a Filter uses a token stream for both input and output. > A CharFilter uses a single string as both input and output. > > An analysis chain in the Solr schema (whether it's index or query) is > composed of zero or more CharFilter entries, exactly one Tokenizer entry, > and zero or more Filter entries. Alternately, you can specify an Analyzer > class, which is a lot like a Tokenizer. An Analyzer is effectively the > same thing as a tokenizer combined with filters. > > CharFilters run before the Tokenizer, and Filters run after the > Tokenizer. CharFilters, Tokenizers, Filters, and Analyzers are Lucene > concepts. > > > My understanding is tokenizer is used to say how the content should be > > indexed physically in file system. Filters are used to query result > > The format of the index on disk is not controlled by the tokenizer, or > anything else in the analysis chain. It is controlled by the Lucene > codec. Only a very small part of the codec is configurable in Solr, but > normally this does not need configuring. The codec defaults are > appropriate for the majority of use cases. > > Thanks, > Shawn > >
RE: FW: Difference Between Tokenizer and filter
The "index" type analyzer is used when documents are indexed and determines what tokens end up in the index. The "query" type analyzer is used to analyze the user query and determines what tokens will be searched for. As an example: If you want to be able to match on synonyms, you could have a "query" type analyzer that replaces each token in the users' query with the list of corresponding synonyms. The "index" type analyzer should just index the tokens as they are. (If you have a fixed list of synonyms, both could map each token to a pre-defined 'canonical' synonym and save both index and query time) Luc -Original Message- From: G, Rajesh [mailto:r...@cebglobal.com] Sent: donderdag 3 maart 2016 14:51 To: solr-user@lucene.apache.org Subject: RE: FW: Difference Between Tokenizer and filter Hi Shawn, One last question on analyzer. If the format of the index on disk is not controlled by the tokenizer, or anything else in the analysis chain, then what does type="index" and type="query" in analyzer mean. Can you please help me in understanding? Corporate Executive Board India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited. -Original Message- From: G, Rajesh Sent: Thursday, March 3, 2016 6:12 PM To: 'solr-user@lucene.apache.org' <solr-user@lucene.apache.org> Subject: RE: FW: Difference Between Tokenizer and filter Thanks Shawn. This helps -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Wednesday, March 2, 2016 11:04 PM To: solr-user@lucene.apache.org Subject: Re: FW: Difference Between Tokenizer and filter On 3/2/2016 9:55 AM, G, Rajesh wrote: > Thanks for your email Koji. Can you please explain what is the role of > tokenizer and filter so I can understand why I should not have two tokenizer > in index and I should have at least one tokenizer in query? You can't have two tokenizers. It's not allowed. The only notable difference between a Tokenizer and a Filter is that a Tokenizer operates on an input that's a single string, turning it into a token stream, and a Filter uses a token stream for both input and output. A CharFilter uses a single string as both input and output. An analysis chain in the Solr schema (whether it's index or query) is composed of zero or more CharFilter entries, exactly one Tokenizer entry, and zero or more Filter entries. Alternately, you can specify an Analyzer class, which is a lot like a Tokenizer. An Analyzer is effectively the same thing as a tokenizer combined with filters. CharFilters run before the Tokenizer, and Filters run after the Tokenizer. CharFilters, Tokenizers, Filters, and Analyzers are Lucene concepts. > My understanding is tokenizer is used to say how the content should be > indexed physically in file system. Filters are used to query result The format of the index on disk is not controlled by the tokenizer, or anything else in the analysis chain. It is controlled by the Lucene codec. Only a very small part of the codec is configurable in Solr, but normally this does not need configuring. The codec defaults are appropriate for the majority of use cases. Thanks, Shawn
RE: FW: Difference Between Tokenizer and filter
Hi Shawn, One last question on analyzer. If the format of the index on disk is not controlled by the tokenizer, or anything else in the analysis chain, then what does type="index" and type="query" in analyzer mean. Can you please help me in understanding? Corporate Executive Board India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited. -Original Message- From: G, Rajesh Sent: Thursday, March 3, 2016 6:12 PM To: 'solr-user@lucene.apache.org' <solr-user@lucene.apache.org> Subject: RE: FW: Difference Between Tokenizer and filter Thanks Shawn. This helps -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Wednesday, March 2, 2016 11:04 PM To: solr-user@lucene.apache.org Subject: Re: FW: Difference Between Tokenizer and filter On 3/2/2016 9:55 AM, G, Rajesh wrote: > Thanks for your email Koji. Can you please explain what is the role of > tokenizer and filter so I can understand why I should not have two tokenizer > in index and I should have at least one tokenizer in query? You can't have two tokenizers. It's not allowed. The only notable difference between a Tokenizer and a Filter is that a Tokenizer operates on an input that's a single string, turning it into a token stream, and a Filter uses a token stream for both input and output. A CharFilter uses a single string as both input and output. An analysis chain in the Solr schema (whether it's index or query) is composed of zero or more CharFilter entries, exactly one Tokenizer entry, and zero or more Filter entries. Alternately, you can specify an Analyzer class, which is a lot like a Tokenizer. An Analyzer is effectively the same thing as a tokenizer combined with filters. CharFilters run before the Tokenizer, and Filters run after the Tokenizer. CharFilters, Tokenizers, Filters, and Analyzers are Lucene concepts. > My understanding is tokenizer is used to say how the content should be > indexed physically in file system. Filters are used to query result The format of the index on disk is not controlled by the tokenizer, or anything else in the analysis chain. It is controlled by the Lucene codec. Only a very small part of the codec is configurable in Solr, but normally this does not need configuring. The codec defaults are appropriate for the majority of use cases. Thanks, Shawn
RE: FW: Difference Between Tokenizer and filter
Thanks Shawn. This helps Corporate Executive Board India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited. -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Wednesday, March 2, 2016 11:04 PM To: solr-user@lucene.apache.org Subject: Re: FW: Difference Between Tokenizer and filter On 3/2/2016 9:55 AM, G, Rajesh wrote: > Thanks for your email Koji. Can you please explain what is the role of > tokenizer and filter so I can understand why I should not have two tokenizer > in index and I should have at least one tokenizer in query? You can't have two tokenizers. It's not allowed. The only notable difference between a Tokenizer and a Filter is that a Tokenizer operates on an input that's a single string, turning it into a token stream, and a Filter uses a token stream for both input and output. A CharFilter uses a single string as both input and output. An analysis chain in the Solr schema (whether it's index or query) is composed of zero or more CharFilter entries, exactly one Tokenizer entry, and zero or more Filter entries. Alternately, you can specify an Analyzer class, which is a lot like a Tokenizer. An Analyzer is effectively the same thing as a tokenizer combined with filters. CharFilters run before the Tokenizer, and Filters run after the Tokenizer. CharFilters, Tokenizers, Filters, and Analyzers are Lucene concepts. > My understanding is tokenizer is used to say how the content should be > indexed physically in file system. Filters are used to query result The format of the index on disk is not controlled by the tokenizer, or anything else in the analysis chain. It is controlled by the Lucene codec. Only a very small part of the codec is configurable in Solr, but normally this does not need configuring. The codec defaults are appropriate for the majority of use cases. Thanks, Shawn
Re: FW: Difference Between Tokenizer and filter
On 3/2/2016 9:55 AM, G, Rajesh wrote: > Thanks for your email Koji. Can you please explain what is the role of > tokenizer and filter so I can understand why I should not have two tokenizer > in index and I should have at least one tokenizer in query? You can't have two tokenizers. It's not allowed. The only notable difference between a Tokenizer and a Filter is that a Tokenizer operates on an input that's a single string, turning it into a token stream, and a Filter uses a token stream for both input and output. A CharFilter uses a single string as both input and output. An analysis chain in the Solr schema (whether it's index or query) is composed of zero or more CharFilter entries, exactly one Tokenizer entry, and zero or more Filter entries. Alternately, you can specify an Analyzer class, which is a lot like a Tokenizer. An Analyzer is effectively the same thing as a tokenizer combined with filters. CharFilters run before the Tokenizer, and Filters run after the Tokenizer. CharFilters, Tokenizers, Filters, and Analyzers are Lucene concepts. > My understanding is tokenizer is used to say how the content should be > indexed physically in file system. Filters are used to query result The format of the index on disk is not controlled by the tokenizer, or anything else in the analysis chain. It is controlled by the Lucene codec. Only a very small part of the codec is configurable in Solr, but normally this does not need configuring. The codec defaults are appropriate for the majority of use cases. Thanks, Shawn
RE: FW: Difference Between Tokenizer and filter
Thanks for your email Koji. Can you please explain what is the role of tokenizer and filter so I can understand why I should not have two tokenizer in index and I should have at least one tokenizer in query? My understanding is tokenizer is used to say how the content should be indexed physically in file system. Filters are used to query result Corporate Executive Board India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited. -Original Message- From: Koji Sekiguchi [mailto:koji.sekigu...@rondhuit.com] Sent: Wednesday, March 2, 2016 8:10 PM To: solr-user@lucene.apache.org Subject: Re: FW: Difference Between Tokenizer and filter Hi, ... must have one and only one and it can have zero or more s. From the point of view of the rules, your ... is not correct because it has more than one and ... is not correct as well because it has no . Koji On 2016/03/02 20:25, G, Rajesh wrote: > Hi Team, > > Can you please clarify the below. My understanding is tokenizer is used to > say how the content should be indexed physically in file system. Filters are > used to query result. The blow lines are from my setup. But I have seen eg > that include filters inside and tokenizer in > that confused me. > > positionIncrementGap="100" > > > class="solr.LowerCaseTokenizerFactory"/> > class="solr.StandardTokenizerFactory"/> > class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/> > > > minGramSize="2" maxGramSize="2"/> > > > > My goal is to user solr and find the best match among the technology > names e.g Actual tech name > > 1. Microsoft Visual Studio > > 2. Microsoft Internet Explorer > > 3. Microsoft Visio > > When user types Microsoft Visal Studio user should get Microsoft > Visual Studio. Basically misspelled and jumble words should match > closest tech name > > > > > > Corporate Executive Board India Private Limited. Registration No: > U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building > No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.. > > > > This e-mail and/or its attachments are intended only for the use of the > addressee(s) and may contain confidential and legally privileged information > belonging to CEB and/or its subsidiaries, including CEB subsidiaries that > offer SHL Talent Measurement products and services. If you have received this > e-mail in error, please notify the sender and immediately, destroy all copies > of this email and its attachments. The publication, copying, in whole or in > part, or use or dissemination in any other way of this e-mail and attachments > by anyone other than the intended person(s) is prohibited. > >
Re: FW: Difference Between Tokenizer and filter
Hi, ... must have one and only one and it can have zero or more s. From the point of view of the rules, your ... is not correct because it has more than one and ... is not correct as well because it has no . Koji On 2016/03/02 20:25, G, Rajesh wrote: Hi Team, Can you please clarify the below. My understanding is tokenizer is used to say how the content should be indexed physically in file system. Filters are used to query result. The blow lines are from my setup. But I have seen eg that include filters inside and tokenizer in that confused me. My goal is to user solr and find the best match among the technology names e.g Actual tech name 1. Microsoft Visual Studio 2. Microsoft Internet Explorer 3. Microsoft Visio When user types Microsoft Visal Studio user should get Microsoft Visual Studio. Basically misspelled and jumble words should match closest tech name Corporate Executive Board India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.
FW: Difference Between Tokenizer and filter
Hi Team, Can you please clarify the below. My understanding is tokenizer is used to say how the content should be indexed physically in file system. Filters are used to query result. The blow lines are from my setup. But I have seen eg that include filters inside and tokenizer in that confused me. My goal is to user solr and find the best match among the technology names e.g Actual tech name 1. Microsoft Visual Studio 2. Microsoft Internet Explorer 3. Microsoft Visio When user types Microsoft Visal Studio user should get Microsoft Visual Studio. Basically misspelled and jumble words should match closest tech name Corporate Executive Board India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.
Re: FW: Difference Between Tokenizer and filter
Hi Rajesh, Processing flow is same for both indexing and querying. What is compared at the end are resulting tokens. In general flow is: text -> char filter -> filtered text -> tokenizer -> tokens -> filter1 -> tokens ... -> filterN -> tokens. You can read more about analysis chain in Solr wiki: https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters Regards, Emir -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On 02.03.2016 10:00, G, Rajesh wrote: Hi Team, Can you please clarify the below. My understanding is tokenizer is used to say how the content should be indexed physically in file system. Filters are used to query result. The blow lines are from my setup. But I have seen eg that include filters inside and tokenizer in that confused me. My goal is to user solr and find the best match among the technology names e.g Actual tech name 1. Microsoft Visual Studio 2. Microsoft Internet Explorer 3. Microsoft Visio When user types Microsoft Visal Studio user should get Microsoft Visual Studio. Basically misspelled and jumble words should match closest tech name Corporate Executive Board India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.
FW: Difference Between Tokenizer and filter
Hi Team, Can you please clarify the below. My understanding is tokenizer is used to say how the content should be indexed physically in file system. Filters are used to query result. The blow lines are from my setup. But I have seen eg that include filters inside and tokenizer in that confused me. My goal is to user solr and find the best match among the technology names e.g Actual tech name 1. Microsoft Visual Studio 2. Microsoft Internet Explorer 3. Microsoft Visio When user types Microsoft Visal Studio user should get Microsoft Visual Studio. Basically misspelled and jumble words should match closest tech name Corporate Executive Board India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited.
RE: Tokenizer and Filter Factory to index Chinese characters
Yes, but it is a small change :) M. -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Tuesday 7th July 2015 4:50 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters So we have to recompile the analysers ourselves before we can use it in 5.x? Regards, Edwin On 6 July 2015 at 18:44, Markus Jelsma markus.jel...@openindex.io wrote: Yes, analyzers slightly changed since 5.x. https://issues.apache.org/jira/browse/LUCENE-5388 -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Monday 6th July 2015 12:31 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Yes, I tried that also, but I faced some compatibility issues with Solr 5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x versions. I got the following error when I tried to start Solr with Paoding configured: java.lang.VerifyError: class net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final method tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream; at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69) at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102) at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:516) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Regards, Edwin 2015-07-06 16:37 GMT+08:00 davidphilip cherian davidphilipcher...@gmail.com : Hi Edwin, Have you tried the Paoding analyzer? It is not out of the box shipped with Solr jars. You may have to download it and add it to solr libs. https
Re: Tokenizer and Filter Factory to index Chinese characters
characters directly into the URL, the results I get are wrong. http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text highlighting:{ chinese1:{ text:[1月份的制造业产值同比仅增长0 \n \n 新加坡 我国1月份的制造业产值同比仅增长em0.9/em%。 虽然制造业结束连续两个月的萎缩,但比经济师普遍预估的增长em3.3/em%疲软得多。这也意味着,我国今年第一季度的经济很可能让人失望 \n ]}, chinese2:{ text:[Zheng emLin/em emYeo/em]}, chinese3:{ text:[Zheng emLin/em emYeo/em]}, chinese4:{ text:[户只要订购《联合晚报》任一种配套,就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值 em199/em元的Lenovo emTAB/em 2 A7-10七寸平板电脑,或者一架价值em249/em元的Philips Viva]}, chinese5:{ text:[Zheng emLin/em emYeo/em]}}} Why is this so? Regards, Edwin 2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: You may also want to try Paoding if you have enough time to spend: https://github.com/cslinmiso/paoding-analysis -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:38 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Hi, The result doesn't seems that good as well. But you're not using the HMMChineseTokenizerFactory? The output below is from the filters you've shown me. highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份的制造业产值同比仅增长/em0], content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。 em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em? em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em, em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em) \n ], author:[emEdwin/em]}, chinese4:{ id:[chinese4], content:[,em则可获得一架价值/em309em元的/emPhilips Viva Collection HD9045em面包机/em。 \n em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319 1800em订购/em。 \n em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。 \n em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em], author:[emEdwin/em]}}} Regards, Edwin 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io : Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips Viva Collection emHD/em9045面em包机/em。 \n 欲订从速,em读者/em可em登陆/emwww.wbsub.com .emsg/em,或拨打客服em专线/em6319 1800em订购/em。 \n em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。 \n em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅
RE: Tokenizer and Filter Factory to index Chinese characters
Yes, analyzers slightly changed since 5.x. https://issues.apache.org/jira/browse/LUCENE-5388 -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Monday 6th July 2015 12:31 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Yes, I tried that also, but I faced some compatibility issues with Solr 5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x versions. I got the following error when I tried to start Solr with Paoding configured: java.lang.VerifyError: class net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final method tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream; at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69) at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102) at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:516) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Regards, Edwin 2015-07-06 16:37 GMT+08:00 davidphilip cherian davidphilipcher...@gmail.com : Hi Edwin, Have you tried the Paoding analyzer? It is not out of the box shipped with Solr jars. You may have to download it and add it to solr libs. https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding 2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com: I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese characters can work when I use the Query tab in Solr Admin UI. In the Admin UI, it converts the Chinese characters to code before passing it to the URL, so it looks something like this: http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true highlighting:{ chinese5:{ text:[园将办系列活动庆祝入遗 \n
Re: Tokenizer and Filter Factory to index Chinese characters
Hi Edwin, Have you tried the Paoding analyzer? It is not out of the box shipped with Solr jars. You may have to download it and add it to solr libs. https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding 2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com: I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese characters can work when I use the Query tab in Solr Admin UI. In the Admin UI, it converts the Chinese characters to code before passing it to the URL, so it looks something like this: http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true highlighting:{ chinese5:{ text:[园将办系列活动庆祝入遗 \n 从em胡姬花/em展到音 乐会,为庆祝申遗成功,植物园这个月起将举办一系列活动与公众一同庆贺。 本月10日开始的“新加坡植物园em胡姬/em及其文化遗产”展览,将展出1万 6000株em胡姬花/em,这是]}, chinese3:{ text:[ \n 原版为 马来语 《Majulah Singapura》,中文译为《 前 进吧,新加坡 》。 \n \n \t 国花 \n 新加坡以一种名为 卓 锦 · 万代 兰 的em胡姬花/em为国花。东南亚通称兰花为em胡姬花/em]}}} However, if I enter the Chinese characters directly into the URL, the results I get are wrong. http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text highlighting:{ chinese1:{ text:[1月份的制造业产值同比仅增长0 \n \n 新加坡 我国1月份的制造业产值同比仅增长em0.9/em%。 虽然制造业结束连续两个月的萎缩,但比经济师普遍预估的增长em3.3/em%疲软得多。这也意味着,我国今年第一季度的经济很可能让人失望 \n ]}, chinese2:{ text:[Zheng emLin/em emYeo/em]}, chinese3:{ text:[Zheng emLin/em emYeo/em]}, chinese4:{ text:[户只要订购《联合晚报》任一种配套,就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值 em199/em元的Lenovo emTAB/em 2 A7-10七寸平板电脑,或者一架价值em249/em元的Philips Viva]}, chinese5:{ text:[Zheng emLin/em emYeo/em]}}} Why is this so? Regards, Edwin 2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: You may also want to try Paoding if you have enough time to spend: https://github.com/cslinmiso/paoding-analysis -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:38 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Hi, The result doesn't seems that good as well. But you're not using the HMMChineseTokenizerFactory? The output below is from the filters you've shown me. highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份的制造业产值同比仅增长/em0], content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。 em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em? em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em, em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em) \n ], author:[emEdwin/em]}, chinese4:{ id:[chinese4], content:[,em则可获得一架价值/em309em元的/emPhilips Viva Collection HD9045em面包机/em。 \n em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319 1800em订购/em。 \n em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。 \n em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em], author:[emEdwin/em]}}} Regards, Edwin 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国
Re: Tokenizer and Filter Factory to index Chinese characters
So we have to recompile the analysers ourselves before we can use it in 5.x? Regards, Edwin On 6 July 2015 at 18:44, Markus Jelsma markus.jel...@openindex.io wrote: Yes, analyzers slightly changed since 5.x. https://issues.apache.org/jira/browse/LUCENE-5388 -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Monday 6th July 2015 12:31 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Yes, I tried that also, but I faced some compatibility issues with Solr 5.2.1, as the libs that I found and downloaded seems to be for Solr 3.x versions. I got the following error when I tried to start Solr with Paoding configured: java.lang.VerifyError: class net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final method tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream; at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:175) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69) at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102) at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:516) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:283) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:277) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Regards, Edwin 2015-07-06 16:37 GMT+08:00 davidphilip cherian davidphilipcher...@gmail.com : Hi Edwin, Have you tried the Paoding analyzer? It is not out of the box shipped with Solr jars. You may have to download it and add it to solr libs. https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/paoding 2015-07-06 12:29 GMT+05:30 Zheng Lin Edwin Yeo edwinye...@gmail.com: I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese characters can work when I use the Query tab in Solr Admin UI. In the Admin UI, it converts the Chinese characters to code
Re: Tokenizer and Filter Factory to index Chinese characters
I'm now using the solr.ICUTokenizerFactory, and the searching for Chinese characters can work when I use the Query tab in Solr Admin UI. In the Admin UI, it converts the Chinese characters to code before passing it to the URL, so it looks something like this: http://localhost:8983/solr/chinese2/select?q=%E8%83%A1%E5%A7%AC%E8%8A%B1wt=jsonindent=truehl=true highlighting:{ chinese5:{ text:[园将办系列活动庆祝入遗 \n 从em胡姬花/em展到音 乐会,为庆祝申遗成功,植物园这个月起将举办一系列活动与公众一同庆贺。 本月10日开始的“新加坡植物园em胡姬/em及其文化遗产”展览,将展出1万 6000株em胡姬花/em,这是]}, chinese3:{ text:[ \n 原版为 马来语 《Majulah Singapura》,中文译为《 前 进吧,新加坡 》。 \n \n \t 国花 \n 新加坡以一种名为 卓 锦 · 万代 兰 的em胡姬花/em为国花。东南亚通称兰花为em胡姬花/em]}}} However, if I enter the Chinese characters directly into the URL, the results I get are wrong. http://localhost:8983/solr/chinese2/select?q=胡姬花hl=truehl.fl=text highlighting:{ chinese1:{ text:[1月份的制造业产值同比仅增长0 \n \n 新加坡 我国1月份的制造业产值同比仅增长em0.9/em%。 虽然制造业结束连续两个月的萎缩,但比经济师普遍预估的增长em3.3/em%疲软得多。这也意味着,我国今年第一季度的经济很可能让人失望 \n ]}, chinese2:{ text:[Zheng emLin/em emYeo/em]}, chinese3:{ text:[Zheng emLin/em emYeo/em]}, chinese4:{ text:[户只要订购《联合晚报》任一种配套,就可选择下列其中一项赠品带回家。 \n 签订两年配套的读者可获得一台价值 em199/em元的Lenovo emTAB/em 2 A7-10七寸平板电脑,或者一架价值em249/em元的Philips Viva]}, chinese5:{ text:[Zheng emLin/em emYeo/em]}}} Why is this so? Regards, Edwin 2015-06-25 18:54 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: You may also want to try Paoding if you have enough time to spend: https://github.com/cslinmiso/paoding-analysis -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:38 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Hi, The result doesn't seems that good as well. But you're not using the HMMChineseTokenizerFactory? The output below is from the filters you've shown me. highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份的制造业产值同比仅增长/em0], content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。 em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em? em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em, em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em) \n ], author:[emEdwin/em]}, chinese4:{ id:[chinese4], content:[,em则可获得一架价值/em309em元的/emPhilips Viva Collection HD9045em面包机/em。 \n em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319 1800em订购/em。 \n em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。 \n em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em], author:[emEdwin/em]}}} Regards, Edwin 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套
RE: Tokenizer and Filter Factory to index Chinese characters
You may also want to try Paoding if you have enough time to spend: https://github.com/cslinmiso/paoding-analysis -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:38 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Hi, The result doesn't seems that good as well. But you're not using the HMMChineseTokenizerFactory? The output below is from the filters you've shown me. highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份的制造业产值同比仅增长/em0], content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。 em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em? em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em, em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em) \n ], author:[emEdwin/em]}, chinese4:{ id:[chinese4], content:[,em则可获得一架价值/em309em元的/emPhilips Viva Collection HD9045em面包机/em。 \n em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319 1800em订购/em。 \n em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。 \n em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em], author:[emEdwin/em]}}} Regards, Edwin 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips Viva Collection emHD/em9045面em包机/em。 \n 欲订从速,em读者/em可em登陆/emwww.wbsub.com .emsg/em,或拨打客服em专线/em6319 1800em订购/em。 \n em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。 \n em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“], author:[emEdwin/em]}}} Is there any suitable filter factory to solve this issue? I've tried WordDelimiterFilterFactory, PorterStemFilterFactory and StopFilterFactory, but there's no improvement in the search results. Regards, Edwin On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote: Hello - you can use HMMChineseTokenizerFactory instead. http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:02 To: solr-user@lucene.apache.org Subject: Tokenizer and Filter Factory to index Chinese characters Hi, Does anyone knows what is the correct replacement for these 2 tokenizer and filter factory to index chinese into Solr? - SmartChineseSentenceTokenizerFactory - SmartChineseWordTokenFilterFactory I understand that these 2 tokenizer and filter factory are already deprecated in Solr 5.1
Tokenizer and Filter Factory to index Chinese characters
Hi, Does anyone knows what is the correct replacement for these 2 tokenizer and filter factory to index chinese into Solr? - SmartChineseSentenceTokenizerFactory - SmartChineseWordTokenFilterFactory I understand that these 2 tokenizer and filter factory are already deprecated in Solr 5.1, but I can't seem to find the correct replacement. fieldType name=text_smartcn class=solr.TextField positionIncrementGap=0 analyzer type=index tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer /fieldType Thank you. Regards, Edwin
Re: Tokenizer and Filter Factory to index Chinese characters
Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips Viva Collection emHD/em9045面em包机/em。 \n 欲订从速,em读者/em可em登陆/emwww.wbsub.com.emsg/em,或拨打客服em专线/em6319 1800em订购/em。 \n em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。 \n em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“], author:[emEdwin/em]}}} Is there any suitable filter factory to solve this issue? I've tried WordDelimiterFilterFactory, PorterStemFilterFactory and StopFilterFactory, but there's no improvement in the search results. Regards, Edwin On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote: Hello - you can use HMMChineseTokenizerFactory instead. http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:02 To: solr-user@lucene.apache.org Subject: Tokenizer and Filter Factory to index Chinese characters Hi, Does anyone knows what is the correct replacement for these 2 tokenizer and filter factory to index chinese into Solr? - SmartChineseSentenceTokenizerFactory - SmartChineseWordTokenFilterFactory I understand that these 2 tokenizer and filter factory are already deprecated in Solr 5.1, but I can't seem to find the correct replacement. fieldType name=text_smartcn class=solr.TextField positionIncrementGap=0 analyzer type=index tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer /fieldType Thank you. Regards, Edwin
RE: Tokenizer and Filter Factory to index Chinese characters
Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips Viva Collection emHD/em9045面em包机/em。 \n 欲订从速,em读者/em可em登陆/emwww.wbsub.com.emsg/em,或拨打客服em专线/em6319 1800em订购/em。 \n em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。 \n em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“], author:[emEdwin/em]}}} Is there any suitable filter factory to solve this issue? I've tried WordDelimiterFilterFactory, PorterStemFilterFactory and StopFilterFactory, but there's no improvement in the search results. Regards, Edwin On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote: Hello - you can use HMMChineseTokenizerFactory instead. http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:02 To: solr-user@lucene.apache.org Subject: Tokenizer and Filter Factory to index Chinese characters Hi, Does anyone knows what is the correct replacement for these 2 tokenizer and filter factory to index chinese into Solr? - SmartChineseSentenceTokenizerFactory - SmartChineseWordTokenFilterFactory I understand that these 2 tokenizer and filter factory are already deprecated in Solr 5.1, but I can't seem to find the correct replacement. fieldType name=text_smartcn class=solr.TextField positionIncrementGap=0 analyzer type=index tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer /fieldType Thank you. Regards, Edwin
RE: Tokenizer and Filter Factory to index Chinese characters
Hello - you can use HMMChineseTokenizerFactory instead. http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:02 To: solr-user@lucene.apache.org Subject: Tokenizer and Filter Factory to index Chinese characters Hi, Does anyone knows what is the correct replacement for these 2 tokenizer and filter factory to index chinese into Solr? - SmartChineseSentenceTokenizerFactory - SmartChineseWordTokenFilterFactory I understand that these 2 tokenizer and filter factory are already deprecated in Solr 5.1, but I can't seem to find the correct replacement. fieldType name=text_smartcn class=solr.TextField positionIncrementGap=0 analyzer type=index tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer /fieldType Thank you. Regards, Edwin
Re: Tokenizer and Filter Factory to index Chinese characters
Hi, The result doesn't seems that good as well. But you're not using the HMMChineseTokenizerFactory? The output below is from the filters you've shown me. highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份的制造业产值同比仅增长/em0], content:[,em但比经济师普遍预估的增长/em3.3%em疲软得多/em。em这也意味着/em,em我国今年第一季度的经济很可能让人失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,em让我国暂时高居奖牌荣誉榜榜首/em。 em你看好新加坡在本届的东运会中/em,em会夺得多少面金牌/em? em请在/em6月em12/emem日中午前/em,em投票并留言为我国健将寄上祝语吧/em \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成的我国女队在今天的东运会保龄球女子三人赛中/em, em以六局/em3963em总瓶分夺冠/em,em为新加坡赢得本届赛会第三枚金牌/em。em队友陈诗桦/em(Jazreel)、em梁蕙芬和陈诗静以/em3707em总瓶分获得亚军/em,em季军归菲律宾女队/em。(em联合早报记者/em:em郭嘉惠/em) \n ], author:[emEdwin/em]}, chinese4:{ id:[chinese4], content:[,em则可获得一架价值/em309em元的/emPhilips Viva Collection HD9045em面包机/em。 \n em欲订从速/em,em读者可登陆/emwww.wbsub.com.sg,em或拨打客服专线/em6319 1800em订购/em。 \n em此外/em,em一年一度的晚报保健美容展/em,em将在本月/emem23/emem日和/emem24/em日,em在新达新加坡会展中心/em401、402em展厅举行/em。 \n em现场将开设/em《em联合晚报/em》em订阅展摊/em,em读者当场订阅晚报/em,em除了可获得丰厚的赠品/em,em还有机会参与/em“em必胜/em”em幸运抽奖/em], author:[emEdwin/em]}}} Regards, Edwin 2015-06-25 17:28 GMT+08:00 Markus Jelsma markus.jel...@openindex.io: Hi - we are actually using some other filters for Chinese, although they are not specialized for Chinese: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.CJKWidthFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.CJKBigramFilterFactory/ -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:24 To: solr-user@lucene.apache.org Subject: Re: Tokenizer and Filter Factory to index Chinese characters Thank you. I've tried that, but when I do a search, it's returning much more highlighted results that what it supposed to. For example, if I enter the following query: http://localhost:8983/solr/chinese1/highlight?q=我国 I get the following results: highlighting:{ chinese1:{ id:[chinese1], title:[em我国/em1em月份/em的制造业em产值/emem同比/em仅em增长/em0], content:[em结束/emem连续/em两个月的em萎缩/em,但比经济师em普遍/emem预估/em的em增长/em3.3%em疲软/em得多。这也意味着,em我国/emem今年/emem第一/emem季度/em的em经济/em很em可能/em让人em失望/em \n ], author:[emEdwin/em]}, chinese2:{ id:[chinese2], content:[em铜牌/em,让em我国/emem暂时/emem高居/emem奖牌/emem荣誉/em榜em榜首/em。 你看好新加坡在本届的东运会中,会em夺得/emem多少/em面em金牌/em? 请在6月em12/em日em中午/em前,em投票/em并em留言/em为em我国/emem健将/em寄上em祝语/em吧 \n ], author:[emEdwin/em]}, chinese3:{ id:[chinese3], content:[)em组成/em的em我国/emem女队/em在em今天/em的东运会保龄球em女子/em三人赛中, 以六局3963总瓶分em夺冠/em,为新加坡em赢得/emem本届/emem赛会/em第三枚em金牌/em。em队友/em陈诗桦(Jazreel)、梁蕙芬和陈诗静以3707总瓶分em获得/emem亚军/em,em季军/em归菲律宾em女队/em。(em联合/emem早报/emem记者/em:郭嘉惠) \n ], author:[Edwin]}, chinese4:{ id:[chinese4], content:[em配套/em的em读者/em,则可em获得/em一架em价值/em309元的Philips Viva Collection emHD/em9045面em包机/em。 \n 欲订从速,em读者/em可em登陆/emwww.wbsub.com .emsg/em,或拨打客服em专线/em6319 1800em订购/em。 \n em此外/em,一年一度的em晚报/emem保健/emem美容/em展,将在em本月/emem23/em日和em24/em日,在新达新加坡em会展/emem中心/em401、402em展厅/emem举行/em。 \n em现场/em将em开设/em《em联合/emem晚报/em》em订阅/em展摊,em读者/emem当场/emem订阅/emem晚报/em,em除了/em可em获得/emem丰厚/em的em赠品/em,还有em机会/emem参与/em“], author:[emEdwin/em]}}} Is there any suitable filter factory to solve this issue? I've tried WordDelimiterFilterFactory, PorterStemFilterFactory and StopFilterFactory, but there's no improvement in the search results. Regards, Edwin On 25 June 2015 at 17:17, Markus Jelsma markus.jel...@openindex.io wrote: Hello - you can use HMMChineseTokenizerFactory instead. http://lucene.apache.org/core/5_2_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/HMMChineseTokenizerFactory.html -Original message- From:Zheng Lin Edwin Yeo edwinye...@gmail.com Sent: Thursday 25th June 2015 11:02 To: solr-user@lucene.apache.org Subject: Tokenizer and Filter Factory to index Chinese characters Hi, Does anyone knows what is the correct replacement for these 2 tokenizer and filter factory to index chinese into Solr? - SmartChineseSentenceTokenizerFactory - SmartChineseWordTokenFilterFactory I understand that these 2 tokenizer and filter factory are already deprecated in Solr 5.1, but I can't seem to find the correct replacement. fieldType name=text_smartcn class=solr.TextField positionIncrementGap=0 analyzer type=index tokenizer class=org.apache.lucene.analysis.cn.smart.SmartChineseSentenceTokenizerFactory/ filter class=org.apache.lucene.analysis.cn.smart.SmartChineseWordTokenFilterFactory/ /analyzer analyzer type=query tokenizer
Re: Tokenizer or Filter ?
It's what Java has, whatever that is: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html So, maybe the correct answer is neither, but similar to both. -- Jack Krupansky On Wed, Jan 14, 2015 at 9:06 AM, tomas.kalas kala...@email.cz wrote: Oh yeah, that is it. Thank you very much for your patience. And a last question at the end what type regEx Solr actually using ? POSIX or PCRE ? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179505.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
It should replace all occurrences of the pattern. Post your specific filter XML. Patterns can be very tricky. Use the Solr Admin UI analysis page to see how the filtering is occurring. -- Jack Krupansky On Wed, Jan 14, 2015 at 7:16 AM, tomas.kalas kala...@email.cz wrote: Jack, thanks for help, but if i used PatternReplaceCharFilterFactory for example for this : d1text d1/d1d2text d2/d2d1text d1/d1d2text 2 ok/d2 then at output i only get segment d2text 2 ok/d2 when is d2 text d2/d2 between marks d1 ./d1.d2.../d2 d1.../d1so the filter probably takes only first d1 and last d1 and if is something between it so the filter it don't skip it and replace it by space too, when i set at replacement space. So not better used the update processor ? If you are described it well in your book then i will buy it. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179477.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
I just used Solr UI Analyzer for my test, or must i indexed it firstly? I used this XML code in my schema: fieldType name=direction1 class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=lt;d1gt;.*lt;/d1gt; replacement=/ tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType This is my result: http://lucene.472066.n3.nabble.com/file/n4179496/dir1.png -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179496.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
I was suspecting it might do that - the pattern is greedy and takes the longest matching pattern. Add a question mark after the asterisk to use stingy mode that matches the shortest pattern. -- Jack Krupansky On Wed, Jan 14, 2015 at 8:37 AM, tomas.kalas kala...@email.cz wrote: I just used Solr UI Analyzer for my test, or must i indexed it firstly? I used this XML code in my schema: fieldType name=direction1 class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=lt;d1gt;.*lt;/d1gt; replacement=/ tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType This is my result: http://lucene.472066.n3.nabble.com/file/n4179496/dir1.png -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179496.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
Jack, thanks for help, but if i used PatternReplaceCharFilterFactory for example for this : d1text d1/d1d2text d2/d2d1text d1/d1d2text 2 ok/d2 then at output i only get segment d2text 2 ok/d2 when is d2 text d2/d2 between marks d1 ./d1.d2.../d2 d1.../d1so the filter probably takes only first d1 and last d1 and if is something between it so the filter it don't skip it and replace it by space too, when i set at replacement space. So not better used the update processor ? If you are described it well in your book then i will buy it. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179477.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
Oh yeah, that is it. Thank you very much for your patience. And a last question at the end what type regEx Solr actually using ? POSIX or PCRE ? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179505.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
Actually, you may be able to get by using PatternReplaceCharFilterFactory - copy the source value to two fields, one that treats d2.*/d2 as the delimiter pattern to delete and then other uses d1.*/d1 as the delimiter pattern to delete, so the first field has only d1 and then second has only d2. You can use a second pattern char filter to remove the [/]d[12 markers as well, probably changing them to a space in both cases. See: http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html -- Jack Krupansky On Tue, Jan 13, 2015 at 11:40 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Would it be sufficient for your user case to simply extract all the d1 into one field and all the d2 in another field? If so, the update processor script would be very simple, simply matching all d1.*/d1 and copying them to a separate field value and same for d2. If you want examples of script update processors, see my Solr e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html -- Jack Krupansky On Tue, Jan 13, 2015 at 9:21 AM, tomas.kalas kala...@email.cz wrote: Thanks Jack for your advice. Can you please explain me little more, how it works? From Apache Wiki it's not to clear for me. I can write some javaScript code when i want filtering some data ? In this case i have d1bla bla bla/d1 d2 bla bla bla /d2 d1bla bla bla /d1 and i want filtering d2 bla bla bla /d2, But in other case i want filtering all d1 /d1 then i suppose i used it at indexed data and filtering from them? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179173.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
Thanks Jack for your advice. Can you please explain me little more, how it works? From Apache Wiki it's not to clear for me. I can write some javaScript code when i want filtering some data ? In this case i have d1bla bla bla/d1 d2 bla bla bla /d2 d1bla bla bla /d1 and i want filtering d2 bla bla bla /d2, But in other case i want filtering all d1 /d1 then i suppose i used it at indexed data and filtering from them? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179173.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
Would it be sufficient for your user case to simply extract all the d1 into one field and all the d2 in another field? If so, the update processor script would be very simple, simply matching all d1.*/d1 and copying them to a separate field value and same for d2. If you want examples of script update processors, see my Solr e-book: http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html -- Jack Krupansky On Tue, Jan 13, 2015 at 9:21 AM, tomas.kalas kala...@email.cz wrote: Thanks Jack for your advice. Can you please explain me little more, how it works? From Apache Wiki it's not to clear for me. I can write some javaScript code when i want filtering some data ? In this case i have d1bla bla bla/d1 d2 bla bla bla /d2 d1bla bla bla /d1 and i want filtering d2 bla bla bla /d2, But in other case i want filtering all d1 /d1 then i suppose i used it at indexed data and filtering from them? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4179173.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
I'm used the same regex and it doesn't work unfortunately. Or should I somehow change the regex? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4178389.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
Consider an update processor - it can take any input, break it up any way you want, and then output multiple field values. You can even us the stateless script update processor to write the logic in JavaScript. -- Jack Krupansky On Fri, Jan 9, 2015 at 6:47 AM, tomas.kalas kala...@email.cz wrote: Hello, i have a question what i have to use tokenizer or filter ? I need separate 2 chanels. I wrote this here earlier, but realize it with solr basic tools it is not probably possible. And i',m trying to write own tool for this task. I have this input d1Hello/d1d2Hello/d2d1How are you ?/d1d2Fine and you're?/d2 d1 - direction1 d2 - direction2 and i want to output only d1 and between this result search some words, for example output should be: Output: [d1Hello/d1,d1How are you?/d1d1/d1] I wrote my idea in java, but i dont know where to incorporate it. If to Filter or Tokenizer and some advices how to start? I probably must extends some lucene library and include it easily modificated there isn't it ? Here is my code: package test1; import java.util.Arrays; public class Test1 { public static void main(String[] args) { String dialogue = d1Hello/d1d2Hello/d2d1How are you ?/d1d2Fine and you're?/d2 ; String[] input = dialogue.split((?=/d[12])\\d*(?=d[12])); int countD1 = 0; for (String input1 : input) { if (input1.startsWith(d1)) { countD1++; } } String [] d1 = new String[countD1]; int array = 0; for (String input1 : input) { if (input1.startsWith(d1)) { d1[array] = input1; array++; } } String d1Out = Arrays.toString(d1); System.out.println(d1Out); //Return s1Out } } Thanks for you advices. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html Sent from the Solr - User mailing list archive at Nabble.com.
Tokenizer or Filter ?
Hello, i have a question what i have to use tokenizer or filter ? I need separate 2 chanels. I wrote this here earlier, but realize it with solr basic tools it is not probably possible. And i',m trying to write own tool for this task. I have this input d1Hello/d1d2Hello/d2d1How are you ?/d1d2Fine and you're?/d2 d1 - direction1 d2 - direction2 and i want to output only d1 and between this result search some words, for example output should be: Output: [d1Hello/d1,d1How are you?/d1d1/d1] I wrote my idea in java, but i dont know where to incorporate it. If to Filter or Tokenizer and some advices how to start? I probably must extends some lucene library and include it easily modificated there isn't it ? Here is my code: package test1; import java.util.Arrays; public class Test1 { public static void main(String[] args) { String dialogue = d1Hello/d1d2Hello/d2d1How are you ?/d1d2Fine and you're?/d2 ; String[] input = dialogue.split((?=/d[12])\\d*(?=d[12])); int countD1 = 0; for (String input1 : input) { if (input1.startsWith(d1)) { countD1++; } } String [] d1 = new String[countD1]; int array = 0; for (String input1 : input) { if (input1.startsWith(d1)) { d1[array] = input1; array++; } } String d1Out = Arrays.toString(d1); System.out.println(d1Out); //Return s1Out } } Thanks for you advices. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
Can't you use solr.PatternTokenizerFactory for this task? On Friday, January 9, 2015 1:48 PM, tomas.kalas kala...@email.cz wrote: Hello, i have a question what i have to use tokenizer or filter ? I need separate 2 chanels. I wrote this here earlier, but realize it with solr basic tools it is not probably possible. And i',m trying to write own tool for this task. I have this input d1Hello/d1d2Hello/d2d1How are you ?/d1d2Fine and you're?/d2 d1 - direction1 d2 - direction2 and i want to output only d1 and between this result search some words, for example output should be: Output: [d1Hello/d1,d1How are you?/d1d1/d1] I wrote my idea in java, but i dont know where to incorporate it. If to Filter or Tokenizer and some advices how to start? I probably must extends some lucene library and include it easily modificated there isn't it ? Here is my code: package test1; import java.util.Arrays; public class Test1 { public static void main(String[] args) { String dialogue = d1Hello/d1d2Hello/d2d1How are you ?/d1d2Fine and you're?/d2 ; String[] input = dialogue.split((?=/d[12])\\d*(?=d[12])); int countD1 = 0; for (String input1 : input) { if (input1.startsWith(d1)) { countD1++; } } String [] d1 = new String[countD1]; int array = 0; for (String input1 : input) { if (input1.startsWith(d1)) { d1[array] = input1; array++; } } String d1Out = Arrays.toString(d1); System.out.println(d1Out); //Return s1Out } } Thanks for you advices. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html Sent from the Solr - User mailing list archive at Nabble.com.