Re: Tokenizing problem with numbers in query
Hi, Did you re-start tomcat and re-index your collection? Yes Do you want to search inside alpanumeric strings? Or you are interested only prefix queries. Can you give us more examples like target documents and queries. Searching inside would be required, yes. If the above example would work I would already be glad. Bernd
Re: Tokenizing problem with numbers in query
Thanks to both of you for the quick answers, analysis.jsp shows that the WordDelimiterFilterFactory is performing the split I was experimenting around with the delimiters for the last two days but am still unable to obtain the desired result. I tried entirely kicking solr.WordDelimiterFilterFactory from both query and text resulting in effictively crippling the search, I got nearly no results for anything. Removing it only from query also would not show up the target document. The target document looks like this: bla /asdf5qwertz500ddd Searching for /asdf5qwertz (also with tailing wildcard, with or without the leading slash) wont show up the document. It also wont get highlighted in the analysis.jsp I tried setting splitOnNumerics to 0 (no change) as well as changing generateNumberParts to 0 - the query is still being split at the number. Any suggestions? Bernd On Sun, Jan 3, 2010 at 6:27 PM, Erick Erickson erickerick...@gmail.comwrote: This is an *extremely* useful page for figuring out what various tokenizers/filters are doing. The javadocs for the classes referenced can also provide some additional details http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Erick On Sun, Jan 3, 2010 at 11:26 AM, Bernd Brod bernd.b...@gmail.com wrote: Hello, when searching for a string: asdf5qwerty solr will tokenize it to: asdf, 5, qwerty and display documents matching either string. How can i stop this behaviour and make it just search for plain asdf5qwerty? thanks in advance. Bernd
Re: Tokenizing problem with numbers in query
Hi, On Tue, Jan 5, 2010 at 5:17 PM, Erick Erickson erickerick...@gmail.comwrote: We need to back up, this is looking like an XY problem. That is, you're asking for specifics when what would probably be more helpful is for you to describe *what* the problem you're trying to solve is rather than *how* to make a specific behavior happen. Although re-reading your original e-mail does give a clue G If, for instance, you really really want the string indexed and searched literally (if, for instance, it's a part number), you want to use something like WhitespaceTokenizerFactory, perhaps lowercasing too, rather than fiddle around with KeywordTokenizerFactory. If you want some other behavior, please explain it in more detail G... I am indexing files that also include traffic captures (so there can be pretty much anything inside). When looking for a long alphanumeric string I would have expected to have fewer results than when searching with a short one. But through of all the tokenizing it returns more (useless) results. This is very disappointing because i could find these documents with grep easily. Whats even more disappointing: disabling the WordDelimiterFilterFactory (for query and/or text) will just result in 0 hits on my document. Im not quite sure what to do. Ideally I would like to be able to search for strings as a1a1a1a1a1a1a1 that would not match against single a and / or 1. Bernd
Tokenizing problem with numbers in query
Hello, when searching for a string: asdf5qwerty solr will tokenize it to: asdf, 5, qwerty and display documents matching either string. How can i stop this behaviour and make it just search for plain asdf5qwerty? thanks in advance. Bernd