RE: search with wildcard
I know it's documented that Lucene/Solr doesn't apply filters to queries with wildcards, but this seems to trip up a lot of users. I can also see why wildcards break a number of filters, but a number of filters (e.g. mapping charsets) could mostly or entirely work. The N-gram filter is another one that would be great to still run when there wildcards. If you indexed 4-grams and the query is a *testp*, you currently won't get any results; but the N-gram filter could have a wildcard mode that, in this case, would return just the first 4-gram as a token. Is this something you've considered? It would have to be enabled in the core network, but disabled by default for existing filters; then it could be enabled 1-by-1 for existing filters. Apologies if the dev list is a better place for this. Scott -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Thursday, November 21, 2013 8:40 AM To: solr-user@lucene.apache.org Subject: Re: search with wildcard Hi Adnreas, If you don't want to use wildcards at query time, alternative way is to use NGrams at indexing time. This will produce a lot of tokens. e.g. For example 4grams of your example : Supertestplan = supe uper pert erte rtes *test* estp stpl tpla plan Is that you want? By the way why do you want to search inside of words? filter class=solr.NGramFilterFactory minGramSize=3 maxGramSize=4/ On Thursday, November 21, 2013 5:23 PM, Andreas Owen a...@conx.ch wrote: I suppose i have to create another field with diffenet tokenizers and set the boost very low so it doesn't really mess with my ranking because there the word is now in 2 fields. What kind of tokenizer can do the job? From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 21. November 2013 16:13 To: solr-user@lucene.apache.org Subject: search with wildcard I am querying test in solr 4.3.1 over the field below and it's not finding all occurences. It seems that if it is a substring of a word like Supertestplan it isn't found unless I use a wildcards *test*. This is write because of my tokenizer but does someone know a way around this? I don't want to add wildcards because that messes up queries with multiple words. fieldType name=text_de class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_de.txt format=snowball enablePositionIncrements=true/ !-- remove common words -- filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German/ !-- remove noun/adjective inflections like plural endings -- /analyzer /fieldType
RE: fq efficiency
Thanks, that link is very helpful, especially the section, Leapfrog, anyone? This actually seems quite slow for my use case. Suppose we have 10,000 users and 1,000,000 documents. We search for hello for a particular user and let's assume that the fq set for the user is cached. hello is a common word and perhaps 10,000 documents will match. If the user has 100 documents, then finding the intersection requires checking each list ~100 times. If the user has 1,000 documents, we check each list ~1,000 times. That doesn't scale well. My searches are usually in one user's data. How can I take advantage of that? I could have a separate index for each user, but loading so many indexes at once seems infeasible; and dynamically loading unloading indexes is a pain. Or I could create a filter that takes tokens and prepends them with the user id. That seems like a good solution, since my keyword searches always include a user id (and usually just 1 user id). Though I wonder if there is a downside I haven't thought of. Thanks, Scott -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Tuesday, November 05, 2013 4:35 PM To: solr-user@lucene.apache.org Subject: Re: fq efficiency On 11/5/2013 3:36 PM, Scott Schneider wrote: I'm wondering if filter queries are efficient enough for my use cases. I have lots and lots of users in a big, multi-tenant, sharded index. To run a search, I can use an fq on the user id and pass in the search terms. Does this scale well with the # users? I suppose that, since user id is indexed, generating the filter data (which is cached) will be fast. And looking up search terms is fast, of course. But if the search term is a common one that many users have in their documents, then Solr may have to perform an intersection between two large sets: docs from all users with the search term and all of the current user's docs. Also, how about auto-complete and searching with a trailing wildcard? As I understand it, these work well in a single-tenant index because keywords are sorted in the index, so it's easy to get all the search terms that match foo*. In a multi-tenant index, all users' keywords are stored together. So if Lucene were to look at all the keywords from foo to fooz (I'm not sure if it actually does this), it would skip over a large majority of keywords that don't belong to this user. From what I understand, there's not really a whole lot of difference between queries and filter queries when they are NOT cached, except that the main query and the filter queries are executed in parallel, which can save time. When filter queries are found in the filterCache, it's a different story. They get applied *before* the main query, which means that the main query won't have to work as hard. The filterCache stores information about which documents in the entire index match the filter. By storing it as a bitset, the amount of space required is relatively low. Applying filterCache results is very efficient. There are also advanced techniques, like assigning a cost to each filter and creating postfilters: http://yonik.com/posts/advanced-filter-caching-in-solr/ Thanks, Shawn
RE: fq efficiency
Digging a bit more, I think I have answered my own questions. Can someone please say if this sounds right? http://wiki.apache.org/solr/LotsOfCores looks like a pretty good solution. If I give each user his own shard, each query can be run in only one shard. The effect of the filter query will basically be to find that shard. The requirements listed on the wiki suggest that performance will be good. But in Solr 3.x, this won't scale with the # users/shards. Prepending a user id to indexed keywords using an analyzer will break wildcard search. If there is a wildcard, the query analyzer doesn't run filters, so it won't prepend the user id. I could prepend the user id myself before calling Solr, but that seems... bad. Scott -Original Message- From: Scott Schneider [mailto:scott_schnei...@symantec.com] Sent: Thursday, November 07, 2013 2:03 PM To: solr-user@lucene.apache.org Subject: RE: fq efficiency Thanks, that link is very helpful, especially the section, Leapfrog, anyone? This actually seems quite slow for my use case. Suppose we have 10,000 users and 1,000,000 documents. We search for hello for a particular user and let's assume that the fq set for the user is cached. hello is a common word and perhaps 10,000 documents will match. If the user has 100 documents, then finding the intersection requires checking each list ~100 times. If the user has 1,000 documents, we check each list ~1,000 times. That doesn't scale well. My searches are usually in one user's data. How can I take advantage of that? I could have a separate index for each user, but loading so many indexes at once seems infeasible; and dynamically loading unloading indexes is a pain. Or I could create a filter that takes tokens and prepends them with the user id. That seems like a good solution, since my keyword searches always include a user id (and usually just 1 user id). Though I wonder if there is a downside I haven't thought of. Thanks, Scott -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Tuesday, November 05, 2013 4:35 PM To: solr-user@lucene.apache.org Subject: Re: fq efficiency On 11/5/2013 3:36 PM, Scott Schneider wrote: I'm wondering if filter queries are efficient enough for my use cases. I have lots and lots of users in a big, multi-tenant, sharded index. To run a search, I can use an fq on the user id and pass in the search terms. Does this scale well with the # users? I suppose that, since user id is indexed, generating the filter data (which is cached) will be fast. And looking up search terms is fast, of course. But if the search term is a common one that many users have in their documents, then Solr may have to perform an intersection between two large sets: docs from all users with the search term and all of the current user's docs. Also, how about auto-complete and searching with a trailing wildcard? As I understand it, these work well in a single-tenant index because keywords are sorted in the index, so it's easy to get all the search terms that match foo*. In a multi-tenant index, all users' keywords are stored together. So if Lucene were to look at all the keywords from foo to fooz (I'm not sure if it actually does this), it would skip over a large majority of keywords that don't belong to this user. From what I understand, there's not really a whole lot of difference between queries and filter queries when they are NOT cached, except that the main query and the filter queries are executed in parallel, which can save time. When filter queries are found in the filterCache, it's a different story. They get applied *before* the main query, which means that the main query won't have to work as hard. The filterCache stores information about which documents in the entire index match the filter. By storing it as a bitset, the amount of space required is relatively low. Applying filterCache results is very efficient. There are also advanced techniques, like assigning a cost to each filter and creating postfilters: http://yonik.com/posts/advanced-filter-caching-in-solr/ Thanks, Shawn
fq efficiency
Hi all, I'm wondering if filter queries are efficient enough for my use cases. I have lots and lots of users in a big, multi-tenant, sharded index. To run a search, I can use an fq on the user id and pass in the search terms. Does this scale well with the # users? I suppose that, since user id is indexed, generating the filter data (which is cached) will be fast. And looking up search terms is fast, of course. But if the search term is a common one that many users have in their documents, then Solr may have to perform an intersection between two large sets: docs from all users with the search term and all of the current user's docs. Also, how about auto-complete and searching with a trailing wildcard? As I understand it, these work well in a single-tenant index because keywords are sorted in the index, so it's easy to get all the search terms that match foo*. In a multi-tenant index, all users' keywords are stored together. So if Lucene were to look at all the keywords from foo to fooz (I'm not sure if it actually does this), it would skip over a large majority of keywords that don't belong to this user. Thanks, Scott
RE: Problem loading my codec sometimes
Ok, I created SOLR-5278. Thanks again! Scott -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Wednesday, September 25, 2013 10:15 AM To: solr-user@lucene.apache.org Subject: RE: Problem loading my codec sometimes : Ah, I fixed it. I wasn't properly including the : org.apache.lucene.codecs.Codec file in my jar. I wasn't sure if it was : necessary in Solr, since I specify my factory in solrconfig.xml. I : think that's why I could create a new index, but not load an existing : one. Ah interesting. yes, you definitely need the SPI registration in the jar file so that it can resolve codec files found on disk when opening them -- the configuration in solrconfig.xml tells solr hch codec to use when writing new segments, but it must respect the codec information in segements found on disk when opening them (that's how the index backcompat works), and those are looked up via SPI. Can you do me a favor please and still file an issue with these details. the attachments i asked about before would still be handy, but probably not neccessary -- at a minimum could you show us the jar tf output of your plugin jar when you were having the problem. Even if the codec factory code can find the configured codec on startup, we should probably throw a very load error write away if that same codec can't be found by name using SPI to prevent people from running into confusing problems when making mistakes like this. -Hoss
RE: Problem loading my codec sometimes
Thanks for your quick response! My jar was in solr/lib. I removed all the lib directives from solrconfig.xml, but I still get the error. My solr.xml doesn't have sharedLib. By the way, I am running Solr 4.4.0 with most of the default example files (including solr.xml). My schema.xml and solrconfig.xml are from another project using Solr 3.6. I modified them a bit to fix any obvious errors. I still wonder why it can create a new index using my codec, but not load an index previously created with my codec. In solrconfig.xml, I specify the CodecFactory along with the package name, whereas the codec name that is read from the index file has no package name. Could that be the problem? I think that's the way it's supposed to be. Could it be that Solr has my jar in the classpath, but SPI is not registering my codec class from the jar? I'm not familiar with SPI. What else can I try? Thanks, Scott -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Tuesday, September 24, 2013 5:51 PM To: solr-user@lucene.apache.org Subject: Re: Problem loading my codec sometimes On 9/24/2013 6:32 PM, Scott Schneider wrote: I created my own codec and Solr can find it sometimes and not other times. When I start fresh (delete the data folder and run Solr), it all works fine. I can add data and query it. When I stop Solr and start it again, I get: Caused by: java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.Codec with name 'MyCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.The current classpath supports the following names: [SimpleText, Appending, Lucene40, Lucene3x, Lucene41, Lucene42] I added the JAR to the path and I'm pretty sure Java sees it, or else it would not be using my codec when I start fresh. (I've looked at the index files and verified that it's using my codec.) I suppose Solr is asking SPI for my codec based on the codec class name stored in the index files, but I don't see why this would fail when a fresh start works. What I always recommend for those who want to use custom and contrib jars is that they put all such jars (and their dependencies) into ${solr.solr.home}/lib, don't use any lib directives in solrconfig.xml, and don't put the sharedLib attribute into solr.xml. Doing it in any other way has a tendency to trigger bugs or causes jars to get loaded more than once. The ${solr.solr.home} property defaults to $CWD/solr (CWD is current working directory for those who don't already know) and is the location of the solr.xml file. Note that depending on the exact version of Solr and which servlet container you are using, there may actually be two solr.xml files, one which loads solr into your container and one that configures Solr. I am referring to the latter. If you are using the solr example and its directory layout, the directory you would need to put all jars into is example/solr/lib ... which is a directory that doesn't exist and has to be created. http://wiki.apache.org/solr/Solr.xml%20%28supported%20through%204.x%29 http://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond Thanks, Shawn
RE: Problem loading my codec sometimes
Ah, I fixed it. I wasn't properly including the org.apache.lucene.codecs.Codec file in my jar. I wasn't sure if it was necessary in Solr, since I specify my factory in solrconfig.xml. I think that's why I could create a new index, but not load an existing one. Scott -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Wednesday, September 25, 2013 9:49 AM To: solr-user@lucene.apache.org Subject: RE: Problem loading my codec sometimes : I still wonder why it can create a new index using my codec, but not : load an index previously created with my codec. In solrconfig.xml, I : specify the CodecFactory along with the package name, whereas the codec : name that is read from the index file has no package name. Could that : be the problem? I think that's the way it's supposed to be. Could it : be that Solr has my jar in the classpath, but SPI is not registering my : codec class from the jar? I'm not familiar with SPI. it's very possible that there is a classloader / SPI runtime race condition in looking up the codec names found in segment files. This sort of classpath related runtime issue is extremely hard to write tests for. Could you please file a bug and include... * the source of your codec (or a simple sample codec that you can also use to reproduce the problem) * a ziped up copy of your entire solr home directory, including the jar file containing your codec so we can verify the SPI files are in their properly - no need to include an actual index here * some simple sample docments in xml or json taht we can index with the schema you are using -Hoss
RE: Querying a non-indexed field?
Ok, thanks for your answers! Scott -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Wednesday, September 18, 2013 5:36 PM To: solr-user@lucene.apache.org Subject: Re: Querying a non-indexed field? Moreover, you may be trying to save/optimize in a wrong place. Maybe these additional indexed fields are not so costly. Maybe you can optimize in some other part of your setup. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 18, 2013 5:47 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Re: Querying a non-indexed field? : : No. --wunder To elaborate just a bit... : query on a few indexed fields, getting a small # of results. I want to : restrict this further based on values from non-indexed, stored fields. : I can obviously do this myself, but it would be nice if Solr could do ...you could implement this in a custom SearchComponent, or custom qparser that would generate PostFilter compatible queries, that looked at the stored field values -- but it's extremeley unlikeley that you would ever convince any of the lucene/solr devs to agree to commit a general purpose version of this type of logic into the code base -- because in the general case (arbitrary unknown number of documents matching the main query) it would be extremely inefficient and would encourage bad user behavior. -Hoss
Problem loading my codec sometimes
Hello, I created my own codec and Solr can find it sometimes and not other times. When I start fresh (delete the data folder and run Solr), it all works fine. I can add data and query it. When I stop Solr and start it again, I get: Caused by: java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.Codec with name 'MyCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.The current classpath supports the following names: [SimpleText, Appending, Lucene40, Lucene3x, Lucene41, Lucene42] I added the JAR to the path and I'm pretty sure Java sees it, or else it would not be using my codec when I start fresh. (I've looked at the index files and verified that it's using my codec.) I suppose Solr is asking SPI for my codec based on the codec class name stored in the index files, but I don't see why this would fail when a fresh start works. Any thoughts? Thanks, Scott
Querying a non-indexed field?
Hello, Is it possible to restrict query results using a non-indexed, stored field? e.g. I might index fewer fields to reduce the index size. I query on a few indexed fields, getting a small # of results. I want to restrict this further based on values from non-indexed, stored fields. I can obviously do this myself, but it would be nice if Solr could do this for me. Thanks, Scott
Solr substring search
Hello, I'm trying to find out how Solr runs a query for *foo*. Google tells me that you need to use NGramFilterFactory for that kind of substring search, but I find that even with very simple fieldTypes, it just works. (Perhaps because I'm testing on very small data sets, Solr is willing to look through all the keywords.) e.g. This works on the tutorial. Can someone tell me exactly how this works and/or point me to the Lucene code that implements this? Thanks, Scott