[
https://issues.apache.org/jira/browse/SOLR-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511306
]
Mike Klaas commented on SOLR-293:
---------------------------------
> Would it be useful to be able to configure this separately for words and
> numbers?
I think it would, but I wasn't sure. Trivial to implement in either case.
>Is there anything that can be done along the same lines, when not catenating
>for the query analyzer, so "foo-bar" will still become "foo bar", but "A9"
>would stay as "A9"?
There are a couple ways to approach this (though I'm not exactly sure what your
question is):
- instead of minimum part length, restrict analysis to tokens with length <
some value. with N=3, this would let "HiFi/hi-fi" -> "hi fi" but "hi8" ->
"hi8". This makes the setting dependent on separator characters.
- ensure character inclusion. If any letter/number character was not included
in any generated subpart, ensure that a larger containing token is generated.
"high-figh-888" -> "high figh 888" (and not "highfigh888")
"hi-fi-8" -> "hifi8"
- approach the delimiter question differently. Currenly, parts are delimited
on case change, alpha->num (and v.v.), and delimiter chars. The last is much,
much stronger as a lexical delimiter, and it would be nice to recognize the
difference between "java5", "mp3", "4x4" and "99-bottle" "20-cent-piece", etc.
Save for the first, I can't think of easy, efficient implementations. Perhaps
WDF shouldn't get too sophisticated.
> Add "minPartLength" to WordDelimiterFilter
> ------------------------------------------
>
> Key: SOLR-293
> URL: https://issues.apache.org/jira/browse/SOLR-293
> Project: Solr
> Issue Type: New Feature
> Components: update
> Affects Versions: 1.3
> Reporter: Mike Klaas
> Assignee: Mike Klaas
> Priority: Minor
> Fix For: 1.3
>
>
> WDF is handy but over-tokenizes when faced with short word parts:
> A9
> R2D2
> mp3
> This creates one- or two- character tokens which are extremely slow to query
> as the doc freq is so high (this is contributing to a significant portion of
> our slowest queries).
> This patch adds a "minPartLength" option that disables generation of parts
> below a certain length. It is recommended to use it with catenateAll, so as
> to not lose tokens.
> I'll add factory options and tests if we decide to include this (and are
> happy with the parameter name).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.