[ http://issues.apache.org/jira/browse/SOLR-41?page=comments#action_12425603 ] Boris Vitez commented on SOLR-41: ---------------------------------
As Yonik suggested, I uploaded the latest .diff file only. Please ignore .java attachments. The filter now works standalone (without WordDelimiterFilter). I couldn't use suggested setTermText on the existing token as I needed to set correct start and end offsets. The newly created token has the same position increment as the first token that contains the hyphen. > PATCH: HyphenatedWordsFilter, Factory and test > ---------------------------------------------- > > Key: SOLR-41 > URL: http://issues.apache.org/jira/browse/SOLR-41 > Project: Solr > Issue Type: New Feature > Components: search > Reporter: Boris Vitez > Priority: Minor > Attachments: HyphenatedWordsFilter.java, hyphenatedwordsfilter.patch, > hyphenatedwordsfilter.patch, HyphenatedWordsFilterFactory.java, > TestHyphenatedWordsFilter.java > > > When the plain text is extracted from documents, we will often have many > words hyphenated and broken into two lines. This is often the case with > documents where narrow text columns are used, such as newsletters. > In order to increase searching efficiency, this filter unites hyphenated > words broken in two lines. > This filter has to be used together with the WordDelimiterFilter having > catenateWords=1. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
