Solr - Remove specific punctuation marks

2012-09-24 Thread Daisy
Hi;

I am working with apache-solr-3.6.0 on windows machine. I would like to
remove all punctuation marks before indexing except the colon and the
full-stop.

I tried:

fieldType name=text_ar class=solr.TextField positionIncrementGap=100
  analyzer 
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.PatternReplaceFilterFactory
pattern=[\p{Punct}[^\.^\:]] replacement= replace=all/
  /analyzer
/fieldType
But it didn't work. Any Ideas?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr - Remove specific punctuation marks

2012-09-24 Thread Steven A Rowe
Hi Daisy,

I can't see anything wrong with the regex or the XML syntax.

One possibility: if it's Arabic you're matching against, you may want to add 
ARABIC FULL STOP U+06D4 to the set you subtract from \p{Punct}.

If you give an example of your input and your expected output, I might be able 
to help more.

Steve

-Original Message-
From: Daisy [mailto:omnia.za...@gmail.com] 
Sent: Monday, September 24, 2012 5:08 AM
To: solr-user@lucene.apache.org
Subject: Solr - Remove specific punctuation marks

Hi;

I am working with apache-solr-3.6.0 on windows machine. I would like to
remove all punctuation marks before indexing except the colon and the
full-stop.

I tried:

fieldType name=text_ar class=solr.TextField positionIncrementGap=100
  analyzer 
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.PatternReplaceFilterFactory
pattern=[\p{Punct}[^\.^\:]] replacement= replace=all/
  /analyzer
/fieldType
But it didn't work. Any Ideas?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr - Remove specific punctuation marks

2012-09-24 Thread Daisy
Yes I am trying to index Arabic document. There is a problem that the 
regex couldn't be understood in the solr schema and it gives 500 - code
error.
Here is an example:

input:

هذا مثال: للتوضيح (مثال علي علامات الترقيم) انتهي.

I tried also the regex:  pattern=([\(\)\}\{\,[^.:\s+\S+]])
but I failed to remove the bracutes from the text above, when i searched for
a bracket I found result.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009830.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr - Remove specific punctuation marks

2012-09-24 Thread Markus Jelsma


 
 
-Original message-
 From:Daisy omnia.za...@gmail.com
 Sent: Mon 24-Sep-2012 15:09
 To: solr-user@lucene.apache.org
 Subject: RE: Solr - Remove specific punctuation marks
 
 Yes I am trying to index Arabic document. There is a problem that the 
 regex couldn't be understood in the solr schema and it gives 500 - code
 error.

The config is XML. Try encoding the ampersand as amp;

 Here is an example:
 
 input:
 
 هذا مثال: للتوضيح (مثال علي علامات الترقيم) انتهي.
 
 I tried also the regex:  pattern=([\(\)\}\{\,[^.:\s+\S+]])
 but I failed to remove the bracutes from the text above, when i searched for
 a bracket I found result.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009830.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 


RE: Solr - Remove specific punctuation marks

2012-09-24 Thread Daisy
I tried amp; and it solved the 500 error code. But still it could find
punctuation marks.
Although the parsed query didnt contain the punctuation mark,

str name=rawquerystring{/str
str name=querystring{/str
str name=parsedquerytext:/str
str name=parsedquery_toStringtext:/str

 but still the numfound gives 1 

result name=response numFound=1 start=0

and the highlight shows the result of punctuation mark
 em{/em
The steps I did:
1- editing the schema
2- restart the server
3-delete the file
4-index the file




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Remove specific punctuation marks

2012-09-24 Thread Jack Krupansky

1. Which query parser are you using?
2. I see the following comment in the Java 6 doc for regex \p{Punct}: 
POSIX character classes (US-ASCII only), so if any of the punctuation is 
some higher Unicode character code, it won't be matched/removed.
3. It seems very odd that the parsed query has empty terms - normally the 
query parsers will ignore terms that analyze to zero tokens. Maybe your { 
is not an ASCII left brace code and is (apparently) unprintable in the 
parsed query. Or, maybe there is some encoding problem in the analyzer.


-- Jack Krupansky

-Original Message- 
From: Daisy

Sent: Monday, September 24, 2012 9:26 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr - Remove specific punctuation marks

I tried amp; and it solved the 500 error code. But still it could find
punctuation marks.
Although the parsed query didnt contain the punctuation mark,

str name=rawquerystring{/str
str name=querystring{/str
str name=parsedquerytext:/str
str name=parsedquery_toStringtext:/str

but still the numfound gives 1

result name=response numFound=1 start=0

and the highlight shows the result of punctuation mark
em{/em
The steps I did:
1- editing the schema
2- restart the server
3-delete the file
4-index the file




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr - Remove specific punctuation marks

2012-09-24 Thread Jack Krupansky
I tried it and PRFF is indeed generating an empty token. I don't know how 
Lucene will index or query an empty term. I mean, what it should do. In 
any case, it is best to avoid them.


You should be using a charFilter to simply filter raw characters before 
tokenizing. So, try:


charFilter class=solr.PatternReplaceCharFilterFactory/

It has the same pattern and replacement attributes.

-- Jack Krupansky

-Original Message- 
From: Jack Krupansky

Sent: Monday, September 24, 2012 12:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr - Remove specific punctuation marks

1. Which query parser are you using?
2. I see the following comment in the Java 6 doc for regex \p{Punct}:
POSIX character classes (US-ASCII only), so if any of the punctuation is
some higher Unicode character code, it won't be matched/removed.
3. It seems very odd that the parsed query has empty terms - normally the
query parsers will ignore terms that analyze to zero tokens. Maybe your {
is not an ASCII left brace code and is (apparently) unprintable in the
parsed query. Or, maybe there is some encoding problem in the analyzer.

-- Jack Krupansky

-Original Message- 
From: Daisy

Sent: Monday, September 24, 2012 9:26 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr - Remove specific punctuation marks

I tried amp; and it solved the 500 error code. But still it could find
punctuation marks.
Although the parsed query didnt contain the punctuation mark,

str name=rawquerystring{/str
str name=querystring{/str
str name=parsedquerytext:/str
str name=parsedquery_toStringtext:/str

but still the numfound gives 1

result name=response numFound=1 start=0

and the highlight shows the result of punctuation mark
em{/em
The steps I did:
1- editing the schema
2- restart the server
3-delete the file
4-index the file




--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr - Remove specific punctuation marks

2012-09-24 Thread Daisy
How could I know which query parser I am using?
Here is the part of my schema that I am using



fieldType name=text_ar class=solr.TextField
positionIncrementGap=100
  analyzer 
tokenizer class=solr.WhitespaceTokenizerFactory/   
filter class=solr.PatternReplaceFilterFactory pattern=(\()
replacement= replace=all/
  /analyzer
/fieldType

 
   field name=text type=text_ar indexed=true stored=true
termVectors=true multiValued=true/

As shown even if I tried to remove ( the same happened for parsed query
and for numFound.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009915.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Remove specific punctuation marks

2012-09-24 Thread Daisy
Thanks. Finally it works using 

charFilter class=solr.PatternReplaceCharFilterFactory pattern=(\()
replacement= replace=all/ 

I wonder what is the reason for that, and what is the difference between the
filter and the charFilter?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009918.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Remove specific punctuation marks

2012-09-24 Thread Jonathan Rochkind
When I do things like this and want to avoid empty tokens even though 
previous analysis might result in some--I just throw one of these at the 
end of my analysis chain:


!-- get rid of empty string tokens. max is required, although
 we don't really care. --
filter class=solr.LengthFilterFactory min=1 max=/

A charfilter to filter raw characters can certainly still result in an 
empty token, if an initial token was composed solely of chars you wanted 
to filter out!  In which case you probably want the token to be deleted 
entirely, not still there as an empty token. The above length filter is 
one way to do that, although unfortunately requires specifying a 'max' 
even though I didn't actually want to filter out on the high end, oh well.



On 9/24/2012 1:07 PM, Jack Krupansky wrote:

I tried it and PRFF is indeed generating an empty token. I don't know
how Lucene will index or query an empty term. I mean, what it should
do. In any case, it is best to avoid them.

You should be using a charFilter to simply filter raw characters
before tokenizing. So, try:

charFilter class=solr.PatternReplaceCharFilterFactory/

It has the same pattern and replacement attributes.

-- Jack Krupansky

-Original Message- From: Jack Krupansky
Sent: Monday, September 24, 2012 12:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr - Remove specific punctuation marks

1. Which query parser are you using?
2. I see the following comment in the Java 6 doc for regex \p{Punct}:
POSIX character classes (US-ASCII only), so if any of the punctuation is
some higher Unicode character code, it won't be matched/removed.
3. It seems very odd that the parsed query has empty terms - normally the
query parsers will ignore terms that analyze to zero tokens. Maybe your {
is not an ASCII left brace code and is (apparently) unprintable in the
parsed query. Or, maybe there is some encoding problem in the analyzer.

-- Jack Krupansky

-Original Message- From: Daisy
Sent: Monday, September 24, 2012 9:26 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr - Remove specific punctuation marks

I tried amp; and it solved the 500 error code. But still it could find
punctuation marks.
Although the parsed query didnt contain the punctuation mark,

str name=rawquerystring{/str
str name=querystring{/str
str name=parsedquerytext:/str
str name=parsedquery_toStringtext:/str

but still the numfound gives 1

result name=response numFound=1 start=0

and the highlight shows the result of punctuation mark
em{/em
The steps I did:
1- editing the schema
2- restart the server
3-delete the file
4-index the file




--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html

Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Remove specific punctuation marks

2012-09-24 Thread Walter Underwood
I've had problems with empty tokens. You can remove those with this as a step 
in the analyzer chain.

filter class=solr.LengthFilterFactory min=1 max=1024/

wunder

On Sep 24, 2012, at 10:07 AM, Jack Krupansky wrote:

 I tried it and PRFF is indeed generating an empty token. I don't know how 
 Lucene will index or query an empty term. I mean, what it should do. In any 
 case, it is best to avoid them.
 
 You should be using a charFilter to simply filter raw characters before 
 tokenizing. So, try:
 
 charFilter class=solr.PatternReplaceCharFilterFactory/
 
 It has the same pattern and replacement attributes.
 
 -- Jack Krupansky
 
 -Original Message- From: Jack Krupansky
 Sent: Monday, September 24, 2012 12:43 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr - Remove specific punctuation marks
 
 1. Which query parser are you using?
 2. I see the following comment in the Java 6 doc for regex \p{Punct}:
 POSIX character classes (US-ASCII only), so if any of the punctuation is
 some higher Unicode character code, it won't be matched/removed.
 3. It seems very odd that the parsed query has empty terms - normally the
 query parsers will ignore terms that analyze to zero tokens. Maybe your {
 is not an ASCII left brace code and is (apparently) unprintable in the
 parsed query. Or, maybe there is some encoding problem in the analyzer.
 
 -- Jack Krupansky
 
 -Original Message- From: Daisy
 Sent: Monday, September 24, 2012 9:26 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Solr - Remove specific punctuation marks
 
 I tried amp; and it solved the 500 error code. But still it could find
 punctuation marks.
 Although the parsed query didnt contain the punctuation mark,
 
 str name=rawquerystring{/str
 str name=querystring{/str
 str name=parsedquerytext:/str
 str name=parsedquery_toStringtext:/str
 
 but still the numfound gives 1
 
 result name=response numFound=1 start=0
 
 and the highlight shows the result of punctuation mark
 em{/em
 The steps I did:
 1- editing the schema
 2- restart the server
 3-delete the file
 4-index the file
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html
 Sent from the Solr - User mailing list archive at Nabble.com. 

--
Walter Underwood
wun...@wunderwood.org





Re: Solr - Remove specific punctuation marks

2012-09-24 Thread Daisy
Using solr.LengthFilterFactory was great and also solve the problem of
using PatternReplaceFilter. So now I have two solutions. Thanks all for
helping me. One thing I would like to know what is the diffrence between
PatternReplaceFilter and PatternReplaceCharFilter?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009925.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Remove specific punctuation marks

2012-09-24 Thread Shawn Heisey

On 9/24/2012 11:37 AM, Daisy wrote:

One thing I would like to know what is the diffrence between
PatternReplaceFilter and PatternReplaceCharFilter?


The CharFilter version gets applied before anything else, including the 
Tokenizer.  The Filter version gets applied in the order specified in 
the schema file.  I would imagine that if you are allowed to specify 
multiple CharFilter entries (which I have never tested), they would be 
applied in the order they occur, all of them before the Tokenizer.


Thanks,
Shawn