[ 
http://issues.apache.org/jira/browse/SOLR-41?page=comments#action_12425603 ] 
            
Boris Vitez commented on SOLR-41:
---------------------------------

As Yonik suggested, I uploaded the latest .diff file only. Please ignore .java 
attachments.
The filter now works standalone (without  WordDelimiterFilter). I couldn't use 
suggested setTermText on the existing token as I needed to set correct start 
and end offsets. The newly created token has the same position increment as the 
first token that contains the hyphen.

> PATCH: HyphenatedWordsFilter, Factory and test
> ----------------------------------------------
>
>                 Key: SOLR-41
>                 URL: http://issues.apache.org/jira/browse/SOLR-41
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Boris Vitez
>            Priority: Minor
>         Attachments: HyphenatedWordsFilter.java, hyphenatedwordsfilter.patch, 
> hyphenatedwordsfilter.patch, HyphenatedWordsFilterFactory.java, 
> TestHyphenatedWordsFilter.java
>
>
> When the plain text is extracted from documents, we will often have many 
> words hyphenated and broken into two lines. This is often the case with 
> documents where narrow text columns are used, such as newsletters.
> In order to increase searching efficiency, this filter unites hyphenated 
> words broken in two lines.
> This filter has to be used together with the WordDelimiterFilter having 
> catenateWords=1.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to