Performance issues with facets and filter query exclusions
I was doing some performance testing on facet queries and I noticed something odd. Most queries tended to be under 500 ms, but every so often the query time jumped to something like 5000 ms. q=*:*fq={!tag=productBrandId}productBrandId:(156 1227)facet.field={!ex=productBrandId}productBrandIdfacet=true I noticed that the drop in performance happened any time I had a filter query tag match up with a facet exclusion. If I had a query where the tags and exclusions differ, like the following... q=*:*fq={!tag=foo}foo:123facet.field={!ex=bar}barfacet=true ... then performance was fine. Any time a tag and ex parameter matched, I would see the drop in performance. I worked around this by constructing individual queries for each facet I wanted to construct and not using the filter query exclusion feature at all. Running the multiple separate facet queries ended up being much faster than using the filter query exclusion feature. I wasn't able to find anything on the bug tracker about this. Does anyone have any hints about what could be causing this? I'm on Solr 4.4 and have not tested newer versions, so I don't know if this problem has been addressed. - Hayden
Re: Performance issues with facets and filter query exclusions
That query is representative of some of the queries in my test, but I didn't notice any correlation between using the match all docs query and poor query performance. Here's another example of a query that took longer than expected. qt=enq=dress green leatherfq=userId:(383)fq={!tag=productRetailerId}productRetailerId:(83 644)fq={!tag=productCanonicalColorId}productCanonicalColorId:(16 7 13)facet.field={!ex=productRetailerId}productRetailerIdfacet=truefacet.mincount=1facet.limit=100 This query took over five seconds. Here I'm just doing one facet on the field productRetailerId. For the actual search results, Solr will have to do an intersection of four queries: dress green leather, userId:(...) ,productRetailerId:(...) and productCanonicalColorId:(...). For the facet, it will have to compute an intersection on the same queries excluding the productRetailerId:(...) query. To your point about the match all docs query, there are plenty of examples which ran quickly with a match all docs query. I've put together a Google spreadsheet with some of my test results. https://docs.google.com/spreadsheets/d/149k6_CM6JuGMbqhZIfiJetTxDxXcWdKeGU6FomjwO9Y/edit?usp=sharing I ran another test with some simplified facet queries. In these examples, I only did one facet at a time, and never faceted on a field I was running a filter query on. These are examples of queries I would run to get the same functionality as filter query exclusion. https://docs.google.com/spreadsheets/d/1xzS2sbb6btyvydD6Q5X8ecD82DE92Pls-DbK2nwdTvc/edit?usp=sharing Most of these queries run in under 100 ms, but even the slowest tend to be under 500 ms. I can reproduce the functionality of the five second query at the beginning of this email by running two of these simplified queries. There are examples in my first spreadsheet where a filter exclusion is happening and the query performs just fine. However, it seems that all slow queries have a filter exclusion, and no queries without a filter exclusion have query times longer than a second. For reference, all these tests were done on a non-optimized core with about 80 million records, and no indexing happening. Each of the spreadsheets represents performance on a warmed core. I warmed the core by running the test for about a minute before gathering this data. The spreadsheets are output from Solr Meter. I can post logs if that's easier to look at. - Hayden On Fri, Jul 18, 2014 at 11:48 AM, Yonik Seeley yo...@heliosearch.com wrote: On Fri, Jul 18, 2014 at 2:10 PM, Hayden Muhl haydenm...@gmail.com wrote: I was doing some performance testing on facet queries and I noticed something odd. Most queries tended to be under 500 ms, but every so often the query time jumped to something like 5000 ms. q=*:*fq={!tag=productBrandId}productBrandId:(156 1227)facet.field={!ex=productBrandId}productBrandIdfacet=true I noticed that the drop in performance happened any time I had a filter query tag match up with a facet exclusion. Is this an actual query that took a long time, or just an example? My guess is that q is actually much more expensive. If a filter is excluded, the base DocSet for faceting must be re-computed. This involves intersecting all the DocSets for the other filters not excluded (which should all be cached) with the DocSet of the query (which won't be cached and will need to be generated). That last step can be expensive, depending on the query. -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data
Re: Strategies for effective prefix queries?
A copy field does not address my problem, and this has nothing to do with stored fields. This is a query parsing problem, not an indexing problem. Here's the use case. If someone has a username like bob-smith, I would like it to match prefixes of bo and sm. I tokenize the username into the tokens bob and smith. Everything is fine so far. If someone enters bo sm as a search string, I would like bob-smith to be one of the results. The query to do this is straight forward, username:bo* username:sm*. Here's the problem. In order to construct that query, I have to tokenize the search string bo sm **on the client**. I don't want to reimplement tokenization on the client. Is there any way to give Solr the string bo sm, have Solr do the tokenization, then treat each token like a prefix? On Tue, Jul 15, 2014 at 4:55 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: So copyField it to another and apply alternative processing there. Use eDismax to search both. No need to store the copied field, just index it. Regards, Alex On 16/07/2014 2:46 am, Hayden Muhl haydenm...@gmail.com wrote: Both fields? There is only one field here: username. On Mon, Jul 14, 2014 at 6:17 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Search against both fields (one split, one not split)? Keep original and tokenized form? I am doing something similar with class name autocompletes here: https://github.com/arafalov/Solr-Javadoc/blob/master/JavadocIndex/JavadocCollection/conf/schema.xml#L24 Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:04 AM, Hayden Muhl haydenm...@gmail.com wrote: I'm working on using Solr for autocompleting usernames. I'm running into a problem with the wildcard queries (e.g. username:al*). We are tokenizing usernames so that a username like solr-user will be tokenized into solr and user, and will match both sol and use prefixes. The problem is when we get solr-u as a prefix, I'm having to split that up on the client side before I construct a query username:solr* username:u*. I'm basically using a regex as a poor man's tokenizer. Is there a better way to approach this? Is there a way to tell Solr to tokenize a string and use the parts as prefixes? - Hayden
Re: Strategies for effective prefix queries?
Thank you Jorge. I didn't know about that filter. It's just what I was looking for. - Hayden On Wed, Jul 16, 2014 at 4:35 PM, Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu wrote: Perhaps what you’re trying to do could be addressed by using the EdgeNGramFilterFactory filter? For query suggestions I’m using a very similar approach, this is an extract of the configuration I’m using: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory maxGramSize=“10 minGramSize=1/ Basically this allows you to get partial matches from any part of the string, let’s say the field get’s this content at index time: A brown fox”, this document will be matched by the query (“bro”) for instance. My personal recommendation is to use this in a separated field that get’s populated through a copyField, this way you could apply different boosts. Greetings, On Jul 16, 2014, at 2:00 PM, Hayden Muhl haydenm...@gmail.com wrote: A copy field does not address my problem, and this has nothing to do with stored fields. This is a query parsing problem, not an indexing problem. Here's the use case. If someone has a username like bob-smith, I would like it to match prefixes of bo and sm. I tokenize the username into the tokens bob and smith. Everything is fine so far. If someone enters bo sm as a search string, I would like bob-smith to be one of the results. The query to do this is straight forward, username:bo* username:sm*. Here's the problem. In order to construct that query, I have to tokenize the search string bo sm **on the client**. I don't want to reimplement tokenization on the client. Is there any way to give Solr the string bo sm, have Solr do the tokenization, then treat each token like a prefix? On Tue, Jul 15, 2014 at 4:55 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: So copyField it to another and apply alternative processing there. Use eDismax to search both. No need to store the copied field, just index it. Regards, Alex On 16/07/2014 2:46 am, Hayden Muhl haydenm...@gmail.com wrote: Both fields? There is only one field here: username. On Mon, Jul 14, 2014 at 6:17 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Search against both fields (one split, one not split)? Keep original and tokenized form? I am doing something similar with class name autocompletes here: https://github.com/arafalov/Solr-Javadoc/blob/master/JavadocIndex/JavadocCollection/conf/schema.xml#L24 Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:04 AM, Hayden Muhl haydenm...@gmail.com wrote: I'm working on using Solr for autocompleting usernames. I'm running into a problem with the wildcard queries (e.g. username:al*). We are tokenizing usernames so that a username like solr-user will be tokenized into solr and user, and will match both sol and use prefixes. The problem is when we get solr-u as a prefix, I'm having to split that up on the client side before I construct a query username:solr* username:u*. I'm basically using a regex as a poor man's tokenizer. Is there a better way to approach this? Is there a way to tell Solr to tokenize a string and use the parts as prefixes? - Hayden VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu
Re: Strategies for effective prefix queries?
Both fields? There is only one field here: username. On Mon, Jul 14, 2014 at 6:17 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Search against both fields (one split, one not split)? Keep original and tokenized form? I am doing something similar with class name autocompletes here: https://github.com/arafalov/Solr-Javadoc/blob/master/JavadocIndex/JavadocCollection/conf/schema.xml#L24 Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:04 AM, Hayden Muhl haydenm...@gmail.com wrote: I'm working on using Solr for autocompleting usernames. I'm running into a problem with the wildcard queries (e.g. username:al*). We are tokenizing usernames so that a username like solr-user will be tokenized into solr and user, and will match both sol and use prefixes. The problem is when we get solr-u as a prefix, I'm having to split that up on the client side before I construct a query username:solr* username:u*. I'm basically using a regex as a poor man's tokenizer. Is there a better way to approach this? Is there a way to tell Solr to tokenize a string and use the parts as prefixes? - Hayden
Strategies for effective prefix queries?
I'm working on using Solr for autocompleting usernames. I'm running into a problem with the wildcard queries (e.g. username:al*). We are tokenizing usernames so that a username like solr-user will be tokenized into solr and user, and will match both sol and use prefixes. The problem is when we get solr-u as a prefix, I'm having to split that up on the client side before I construct a query username:solr* username:u*. I'm basically using a regex as a poor man's tokenizer. Is there a better way to approach this? Is there a way to tell Solr to tokenize a string and use the parts as prefixes? - Hayden
Wildcard searches and tokenization
I'm working on a user name autocomplete feature, and am having some issues with the way we are tokenizing user names. We're using the StandardTokenizerFactory to tokenize user names, so foo-bar gets split into two tokens. We take input from the user and use it as a prefix to search on the user name. This means wildcard searches of fo* and ba* both return foo-bar, which is what we want. We have a problem when someone types in foo-b as a prefix. I would like to split this into foo and b, then use each as a prefix in a wildcard search. Is there an easy way to tell Solr, Tokenize this, then do a prefix search? I've written at least one QParserPlugin, so that's an option. Hopefully there's an easier way I'm unaware of. - Hayden
Re: java.lang.LinkageError when using custom filters in multiple cores
Upgraded to 4.4.0, and that seems to have fixed it. The transition was mostly painless once I realized that the interface to the AbstractAnalysisFactory had changed between 4.2 and 4.3. Thanks. - Hayden On Sat, Sep 21, 2013 at 3:28 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Did you try latest solr? There was a library loading bug with multiple cores. Not a perfect match to your description but close enough. Regards, Alex On 21 Sep 2013 02:28, Hayden Muhl haydenm...@gmail.com wrote: I have two cores favorite and user running in the same Tomcat instance. In each of these cores I have identical field types text_en, text_de, text_fr, and text_ja. These fields use some custom token filters I've written. Everything was going smoothly when I only had the favorite core. When I added the user core, I started getting java.lang.LinkageErrors being thrown when I start up Tomcat. The error always happens with one of the classes I've written, but it's unpredictable which class the classloader chokes on. Here's the really strange part. I comment out the text_* fields in the user core and the errors go away (makes sense). I add text_en back in, no error (OK). I add text_fr back in, no error (OK). I add text_de back in, and I get the error (ah ha!). I comment text_de out again, and I still get the same error (wtf?). I also put a break point at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:424), and when I load everything one at a time, I don't get any errors. I'm running Tomcat 5.5.28, Java version 1.6.0_39 and Solr 4.2.0. I'm running this all within Eclipse 1.5.1 on a mac. I have not tested this on a production-like system yet. Here's an example stack trace. In this case it was one of my Japanese filters, but other times it will choke on my synonym filter, or my compound word filter. The specific class it fails on doesn't seem to be relevant. SEVERE: null:java.lang.LinkageError: loader (instance of org/apache/catalina/loader/WebappClassLoader): attempted duplicate class definition for name: com/shopstyle/solrx/KatakanaVuFilterFactory at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) at java.lang.ClassLoader.defineClass(ClassLoader.java:615) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:904) at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1353) at java.lang.ClassLoader.loadClass(ClassLoader.java:295) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:627) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:249) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:424) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:462) at org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:89) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:392) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:86) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:373) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:121) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:1018) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1051) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) - Hayden
java.lang.LinkageError when using custom filters in multiple cores
I have two cores favorite and user running in the same Tomcat instance. In each of these cores I have identical field types text_en, text_de, text_fr, and text_ja. These fields use some custom token filters I've written. Everything was going smoothly when I only had the favorite core. When I added the user core, I started getting java.lang.LinkageErrors being thrown when I start up Tomcat. The error always happens with one of the classes I've written, but it's unpredictable which class the classloader chokes on. Here's the really strange part. I comment out the text_* fields in the user core and the errors go away (makes sense). I add text_en back in, no error (OK). I add text_fr back in, no error (OK). I add text_de back in, and I get the error (ah ha!). I comment text_de out again, and I still get the same error (wtf?). I also put a break point at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:424), and when I load everything one at a time, I don't get any errors. I'm running Tomcat 5.5.28, Java version 1.6.0_39 and Solr 4.2.0. I'm running this all within Eclipse 1.5.1 on a mac. I have not tested this on a production-like system yet. Here's an example stack trace. In this case it was one of my Japanese filters, but other times it will choke on my synonym filter, or my compound word filter. The specific class it fails on doesn't seem to be relevant. SEVERE: null:java.lang.LinkageError: loader (instance of org/apache/catalina/loader/WebappClassLoader): attempted duplicate class definition for name: com/shopstyle/solrx/KatakanaVuFilterFactory at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631) at java.lang.ClassLoader.defineClass(ClassLoader.java:615) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:904) at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1353) at java.lang.ClassLoader.loadClass(ClassLoader.java:295) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:627) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:249) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:424) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:462) at org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:89) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:392) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:86) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:373) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:121) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:1018) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1051) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) - Hayden
PositionLengthAttribute - Does it do anything at all?
I've been playing around with the PositionLengthAttribute for a few days, and it doesn't seem to have any effect at all. I'm aware that position length is not stored in the index, as explained in this blog post. http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html However, even when used at query time it doesn't seem to do anything. Let's take the following token stream as an example. text: he posInc: 1 posLen: 1 text: cannot posInc: 1 posLen: 2 text: can posInc: 0 posLen: 1 text: not posInc: 1 posLen: 1 text: help posInc: 1 posLen: 1 If we were to construct this graph of tokens, it should match the phrases he can not help and he cannot help. According to my testing, it will match the phrases he can not help and he cannot not help, because the position length is entirely ignored and treated as if it is always 1. Am I misunderstanding how these attributes work? - Hayden
Re: What to expect when testing Japanese search index
A search for a single character will only return hits if that character makes up a whole word, and only if the tokenizer recognizes that character as a word. It's just like in other languages, where a search for p won't return documents with the word apple. If I were you, I would go into the Solr admin UI and start playing around with the analysis tool. You can paste a phrase in there and it will show you what tokens that phrase will be broken into. I think that will give you a better understanding of why you are getting these search results. You also don't mention which version of Solr you are using. Can you also include the definition of your text_ja field type? - Hayden On Thu, Mar 21, 2013 at 7:01 AM, Van Tassell, Kristian kristian.vantass...@siemens.com wrote: I’m trying to set up our search index to handle Japanese data, and while some searches yield results, others do not. This is especially true the smaller the search term. For example, searching for this term: 更 Yields no results even though I know it appears in the text. I understand that this character alone may not be a full word without further context, and thus, perhaps it should not return a hit(?). What about putting a star after it? 更* Should that return hits? I had been using the text_ja boilerplate setup, but wonder if a bigram (text_cjk) may work better for my non-Japanese speaking testing phase. Thanks in advance for any insight!
Global .properties file for all Solr cores?
I've read the documentation about how you can configure a Solr core with a properties file. Is there any way to specify a properties file that will apply to all cores running on a server? Here's my scenario. I have a solr setup where I have two cores, foo and bar. I want to enable replication using properties, as is suggested on the wiki. http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node I would like my master/slave settings to apply to all cores on a box, but I would still like to have separate solrcore.properties files so that other properties can be set per core. In other words, I would like a setup like this, with three files. #solr.properties # These properties should apply to all cores on a box enable.master=true enable.slave=false #foo.solrcore.properties # These properties only apply to core foo filterCache.size=16384 #bar.solrcore.properties # These properties only apply to core bar filterCache.size=2048 What I'm trying to avoid is having to duplicate the global values across all solrcore.properties files. I've looked into having a .properties file that applies to the whole context, but we are running Tomcat, which does not make this easy. It seems the only way to do this with Tomcat is with the CATALINA_OPTS environment variable, and I would rather duplicate values across solrcore.properties files than use CATALINA_OPTS. - Hayden