Re: Solr - Remove specific punctuation marks

Walter Underwood Mon, 24 Sep 2012 10:32:48 -0700

I've had problems with empty tokens. You can remove those with this as a step 
in the analyzer chain.


        <filter class="solr.LengthFilterFactory" min="1" max="1024"/>

wunder

On Sep 24, 2012, at 10:07 AM, Jack Krupansky wrote:

> I tried it and PRFF is indeed generating an empty token. I don't know how 
> Lucene will index or query an empty term. I mean, what it "should" do. In any 
> case, it is best to avoid them.
> 
> You should be using a "charFilter" to simply filter raw characters before 
> tokenizing. So, try:
> 
> <charFilter class="solr.PatternReplaceCharFilterFactory"/>
> 
> It has the same pattern and replacement attributes.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Jack Krupansky
> Sent: Monday, September 24, 2012 12:43 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr - Remove specific punctuation marks
> 
> 1. Which query parser are you using?
> 2. I see the following comment in the Java 6 doc for regex "\p{Punct}":
> "POSIX character classes (US-ASCII only)", so if any of the punctuation is
> some higher Unicode character code, it won't be matched/removed.
> 3. It seems very odd that the parsed query has empty terms - normally the
> query parsers will ignore terms that analyze to zero tokens. Maybe your "{"
> is not an ASCII left brace code and is (apparently) unprintable in the
> parsed query. Or, maybe there is some encoding problem in the analyzer.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Daisy
> Sent: Monday, September 24, 2012 9:26 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr - Remove specific punctuation marks
> 
> I tried &amp; and it solved the 500 error code. But still it could find
> punctuation marks.
> Although the parsed query didnt contain the punctuation mark,
> 
> <str name="rawquerystring">"{"</str>
> <str name="querystring">"{"</str>
> <str name="parsedquery">text:</str>
> <str name="parsedquery_toString">text:</str>
> 
> but still the numfound gives 1
> 
> <result name="response" numFound="1" start="0">
> 
> and the highlight shows the result of punctuation mark
> <em>{</em>
> The steps I did:
> 1- editing the schema
> 2- restart the server
> 3-delete the file
> 4-index the file
> 
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Remove-specific-punctuation-marks-tp4009795p4009835.html
> Sent from the Solr - User mailing list archive at Nabble.com. 

--
Walter Underwood
wun...@wunderwood.org

Re: Solr - Remove specific punctuation marks

Reply via email to