Solr - Remove specific punctuation marks
Hi; I am working with apache-solr-3.6.0 on windows machine. I would like to remove all punctuation marks before indexing except the colon and the full-stop. I tried: fieldType name=text_ar class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=[\p{Punct}[^\.^\:]] replacement= replace=all/ /analyzer /fieldType But it didn't work. Any Ideas? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr - Remove specific punctuation marks
Hi Daisy, I can't see anything wrong with the regex or the XML syntax. One possibility: if it's Arabic you're matching against, you may want to add ARABIC FULL STOP U+06D4 to the set you subtract from \p{Punct}. If you give an example of your input and your expected output, I might be able to help more. Steve -Original Message- From: Daisy [mailto:omnia.za...@gmail.com] Sent: Monday, September 24, 2012 5:08 AM To: solr-user@lucene.apache.org Subject: Solr - Remove specific punctuation marks Hi; I am working with apache-solr-3.6.0 on windows machine. I would like to remove all punctuation marks before indexing except the colon and the full-stop. I tried: fieldType name=text_ar class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=[\p{Punct}[^\.^\:]] replacement= replace=all/ /analyzer /fieldType But it didn't work. Any Ideas? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr - Remove specific punctuation marks
Yes I am trying to index Arabic document. There is a problem that the regex couldn't be understood in the solr schema and it gives 500 - code error. Here is an example: input: هذا مثال: للتوضيح (مثال علي علامات الترقيم) انتهي. I tried also the regex: pattern=([\(\)\}\{\,[^.:\s+\S+]]) but I failed to remove the bracutes from the text above, when i searched for a bracket I found result. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009830.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr - Remove specific punctuation marks
-Original message- From:Daisy omnia.za...@gmail.com Sent: Mon 24-Sep-2012 15:09 To: solr-user@lucene.apache.org Subject: RE: Solr - Remove specific punctuation marks Yes I am trying to index Arabic document. There is a problem that the regex couldn't be understood in the solr schema and it gives 500 - code error. The config is XML. Try encoding the ampersand as amp; Here is an example: input: هذا مثال: للتوضيح (مثال علي علامات الترقيم) انتهي. I tried also the regex: pattern=([\(\)\}\{\,[^.:\s+\S+]]) but I failed to remove the bracutes from the text above, when i searched for a bracket I found result. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009830.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr - Remove specific punctuation marks
I tried amp; and it solved the 500 error code. But still it could find punctuation marks. Although the parsed query didnt contain the punctuation mark, str name=rawquerystring{/str str name=querystring{/str str name=parsedquerytext:/str str name=parsedquery_toStringtext:/str but still the numfound gives 1 result name=response numFound=1 start=0 and the highlight shows the result of punctuation mark em{/em The steps I did: 1- editing the schema 2- restart the server 3-delete the file 4-index the file -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Remove specific punctuation marks
1. Which query parser are you using? 2. I see the following comment in the Java 6 doc for regex \p{Punct}: POSIX character classes (US-ASCII only), so if any of the punctuation is some higher Unicode character code, it won't be matched/removed. 3. It seems very odd that the parsed query has empty terms - normally the query parsers will ignore terms that analyze to zero tokens. Maybe your { is not an ASCII left brace code and is (apparently) unprintable in the parsed query. Or, maybe there is some encoding problem in the analyzer. -- Jack Krupansky -Original Message- From: Daisy Sent: Monday, September 24, 2012 9:26 AM To: solr-user@lucene.apache.org Subject: RE: Solr - Remove specific punctuation marks I tried amp; and it solved the 500 error code. But still it could find punctuation marks. Although the parsed query didnt contain the punctuation mark, str name=rawquerystring{/str str name=querystring{/str str name=parsedquerytext:/str str name=parsedquery_toStringtext:/str but still the numfound gives 1 result name=response numFound=1 start=0 and the highlight shows the result of punctuation mark em{/em The steps I did: 1- editing the schema 2- restart the server 3-delete the file 4-index the file -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Remove specific punctuation marks
I tried it and PRFF is indeed generating an empty token. I don't know how Lucene will index or query an empty term. I mean, what it should do. In any case, it is best to avoid them. You should be using a charFilter to simply filter raw characters before tokenizing. So, try: charFilter class=solr.PatternReplaceCharFilterFactory/ It has the same pattern and replacement attributes. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Monday, September 24, 2012 12:43 PM To: solr-user@lucene.apache.org Subject: Re: Solr - Remove specific punctuation marks 1. Which query parser are you using? 2. I see the following comment in the Java 6 doc for regex \p{Punct}: POSIX character classes (US-ASCII only), so if any of the punctuation is some higher Unicode character code, it won't be matched/removed. 3. It seems very odd that the parsed query has empty terms - normally the query parsers will ignore terms that analyze to zero tokens. Maybe your { is not an ASCII left brace code and is (apparently) unprintable in the parsed query. Or, maybe there is some encoding problem in the analyzer. -- Jack Krupansky -Original Message- From: Daisy Sent: Monday, September 24, 2012 9:26 AM To: solr-user@lucene.apache.org Subject: RE: Solr - Remove specific punctuation marks I tried amp; and it solved the 500 error code. But still it could find punctuation marks. Although the parsed query didnt contain the punctuation mark, str name=rawquerystring{/str str name=querystring{/str str name=parsedquerytext:/str str name=parsedquery_toStringtext:/str but still the numfound gives 1 result name=response numFound=1 start=0 and the highlight shows the result of punctuation mark em{/em The steps I did: 1- editing the schema 2- restart the server 3-delete the file 4-index the file -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Remove specific punctuation marks
How could I know which query parser I am using? Here is the part of my schema that I am using fieldType name=text_ar class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=(\() replacement= replace=all/ /analyzer /fieldType field name=text type=text_ar indexed=true stored=true termVectors=true multiValued=true/ As shown even if I tried to remove ( the same happened for parsed query and for numFound. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009915.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Remove specific punctuation marks
Thanks. Finally it works using charFilter class=solr.PatternReplaceCharFilterFactory pattern=(\() replacement= replace=all/ I wonder what is the reason for that, and what is the difference between the filter and the charFilter? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009918.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Remove specific punctuation marks
When I do things like this and want to avoid empty tokens even though previous analysis might result in some--I just throw one of these at the end of my analysis chain: !-- get rid of empty string tokens. max is required, although we don't really care. -- filter class=solr.LengthFilterFactory min=1 max=/ A charfilter to filter raw characters can certainly still result in an empty token, if an initial token was composed solely of chars you wanted to filter out! In which case you probably want the token to be deleted entirely, not still there as an empty token. The above length filter is one way to do that, although unfortunately requires specifying a 'max' even though I didn't actually want to filter out on the high end, oh well. On 9/24/2012 1:07 PM, Jack Krupansky wrote: I tried it and PRFF is indeed generating an empty token. I don't know how Lucene will index or query an empty term. I mean, what it should do. In any case, it is best to avoid them. You should be using a charFilter to simply filter raw characters before tokenizing. So, try: charFilter class=solr.PatternReplaceCharFilterFactory/ It has the same pattern and replacement attributes. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Monday, September 24, 2012 12:43 PM To: solr-user@lucene.apache.org Subject: Re: Solr - Remove specific punctuation marks 1. Which query parser are you using? 2. I see the following comment in the Java 6 doc for regex \p{Punct}: POSIX character classes (US-ASCII only), so if any of the punctuation is some higher Unicode character code, it won't be matched/removed. 3. It seems very odd that the parsed query has empty terms - normally the query parsers will ignore terms that analyze to zero tokens. Maybe your { is not an ASCII left brace code and is (apparently) unprintable in the parsed query. Or, maybe there is some encoding problem in the analyzer. -- Jack Krupansky -Original Message- From: Daisy Sent: Monday, September 24, 2012 9:26 AM To: solr-user@lucene.apache.org Subject: RE: Solr - Remove specific punctuation marks I tried amp; and it solved the 500 error code. But still it could find punctuation marks. Although the parsed query didnt contain the punctuation mark, str name=rawquerystring{/str str name=querystring{/str str name=parsedquerytext:/str str name=parsedquery_toStringtext:/str but still the numfound gives 1 result name=response numFound=1 start=0 and the highlight shows the result of punctuation mark em{/em The steps I did: 1- editing the schema 2- restart the server 3-delete the file 4-index the file -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Remove specific punctuation marks
I've had problems with empty tokens. You can remove those with this as a step in the analyzer chain. filter class=solr.LengthFilterFactory min=1 max=1024/ wunder On Sep 24, 2012, at 10:07 AM, Jack Krupansky wrote: I tried it and PRFF is indeed generating an empty token. I don't know how Lucene will index or query an empty term. I mean, what it should do. In any case, it is best to avoid them. You should be using a charFilter to simply filter raw characters before tokenizing. So, try: charFilter class=solr.PatternReplaceCharFilterFactory/ It has the same pattern and replacement attributes. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Monday, September 24, 2012 12:43 PM To: solr-user@lucene.apache.org Subject: Re: Solr - Remove specific punctuation marks 1. Which query parser are you using? 2. I see the following comment in the Java 6 doc for regex \p{Punct}: POSIX character classes (US-ASCII only), so if any of the punctuation is some higher Unicode character code, it won't be matched/removed. 3. It seems very odd that the parsed query has empty terms - normally the query parsers will ignore terms that analyze to zero tokens. Maybe your { is not an ASCII left brace code and is (apparently) unprintable in the parsed query. Or, maybe there is some encoding problem in the analyzer. -- Jack Krupansky -Original Message- From: Daisy Sent: Monday, September 24, 2012 9:26 AM To: solr-user@lucene.apache.org Subject: RE: Solr - Remove specific punctuation marks I tried amp; and it solved the 500 error code. But still it could find punctuation marks. Although the parsed query didnt contain the punctuation mark, str name=rawquerystring{/str str name=querystring{/str str name=parsedquerytext:/str str name=parsedquery_toStringtext:/str but still the numfound gives 1 result name=response numFound=1 start=0 and the highlight shows the result of punctuation mark em{/em The steps I did: 1- editing the schema 2- restart the server 3-delete the file 4-index the file -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html Sent from the Solr - User mailing list archive at Nabble.com. -- Walter Underwood wun...@wunderwood.org
Re: Solr - Remove specific punctuation marks
Using solr.LengthFilterFactory was great and also solve the problem of using PatternReplaceFilter. So now I have two solutions. Thanks all for helping me. One thing I would like to know what is the diffrence between PatternReplaceFilter and PatternReplaceCharFilter? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009925.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Remove specific punctuation marks
On 9/24/2012 11:37 AM, Daisy wrote: One thing I would like to know what is the diffrence between PatternReplaceFilter and PatternReplaceCharFilter? The CharFilter version gets applied before anything else, including the Tokenizer. The Filter version gets applied in the order specified in the schema file. I would imagine that if you are allowed to specify multiple CharFilter entries (which I have never tested), they would be applied in the order they occur, all of them before the Tokenizer. Thanks, Shawn