Re: SolrCloud: Meaning of SYNC state in ZkStateReader?
Ok, thanks for your response, Mark! Cheers, Martin On Tue, Oct 14, 2014 at 1:59 AM, Mark Miller markrmil...@gmail.com wrote: I think it's just cruft I left in and never ended up using anywhere. You can ignore it. - Mark On Oct 13, 2014, at 8:42 PM, Martin Grotzke martin.grot...@googlemail.com wrote: Hi, can anybody tell me the meaning of ZkStateReader.SYNC? All other state related constants are clear to me, I'm only not sure about the semantics of SYNC. Background: I'm working on an async solr client (https://github.com/inoio/solrs) and want to add SolrCloud support - for this I'm reusing ZkStateReader. TIA cheers, Martin -- Martin Grotzke http://twitter.com/martin_grotzke
SolrCloud: Meaning of SYNC state in ZkStateReader?
Hi, can anybody tell me the meaning of ZkStateReader.SYNC? All other state related constants are clear to me, I'm only not sure about the semantics of SYNC. Background: I'm working on an async solr client (https://github.com/inoio/solrs) and want to add SolrCloud support - for this I'm reusing ZkStateReader. TIA cheers, Martin signature.asc Description: OpenPGP digital signature
LBHttpSolrServer to query a preferred server
Hi, we want to use the LBHttpSolrServer (4.0/trunk) and specify a preferred server. Our use case is that for one user request we make several solr requests with some heavy caching (using a custom request handler with a special cache) and want to make sure that the subsequent solr requests are hitting the same solr server. A possible solution with LBHttpSolrServer would look like this: - LBHttpSolrServer provides a method getSolrServer() that returns a ServerWrapper - LBHttpSolrServer provides a method request(final SolrRequest request, ServerWrapper preferredServer) that returns the response (NamedListObject). This method first tries the specified preferredServer and if this fails queries all others (first alive servers then zombies). What do you think of this solution? Any other solution preferred? I'll start implementing this and submit an issue/patch hoping that it makes it into trunk. Cheers, Martin signature.asc Description: OpenPGP digital signature
Re: LBHttpSolrServer to query a preferred server
Hi, I just submitted an issue with patch for this: https://issues.apache.org/jira/browse/SOLR-3318 Cheers, Martin On 04/04/2012 03:53 PM, Martin Grotzke wrote: Hi, we want to use the LBHttpSolrServer (4.0/trunk) and specify a preferred server. Our use case is that for one user request we make several solr requests with some heavy caching (using a custom request handler with a special cache) and want to make sure that the subsequent solr requests are hitting the same solr server. A possible solution with LBHttpSolrServer would look like this: - LBHttpSolrServer provides a method getSolrServer() that returns a ServerWrapper - LBHttpSolrServer provides a method request(final SolrRequest request, ServerWrapper preferredServer) that returns the response (NamedListObject). This method first tries the specified preferredServer and if this fails queries all others (first alive servers then zombies). What do you think of this solution? Any other solution preferred? I'll start implementing this and submit an issue/patch hoping that it makes it into trunk. Cheers, Martin signature.asc Description: OpenPGP digital signature
How to determine memory consumption per core
Hi, is it possible to determine the memory consumption (heap space) per core in solr trunk (4.0-SNAPSHOT)? I just unloaded a core and saw the difference in memory usage, but it would be nice to have a smoother way of getting the information without core downtime. It would also be interesting, which caches are the biggest ones, to know which one should/might be reduced. Thanx cheers, Martin signature.asc Description: OpenPGP digital signature
Re: AW: How to deal with many files using solr external file field
Hi, as I'm also involved in this issue (on the side of Sven) I created a patch, that replaces the float array by a map that stores score by doc, so it contains as many entries as the external scoring file contains lines, but no more. I created an issue for this: https://issues.apache.org/jira/browse/SOLR-2583 It would be great if someone could have a look at it and comment. Thanx for your feedback, cheers, Martin On 06/08/2011 12:22 PM, Bohnsack, Sven wrote: Hi, I could not provide a stack trace and IMHO it won't provide some useful information. But we've made a good progress in the analysis. We took a deeper look at what happened, when an external-file-field-Request is sent to SOLR: * SOLR looks if there is a file for the requested query, e.g. trousers * If so, then SOLR loads the trousers-file and generates a HashMap-Entry consisting of a FileFloatSource-Object and a FloatArray with the size of the number of documents in the SOLR-index. Every document matched by the query gains the score-value, which is provided in the external-score-file. For every(!) other document SOLR writes a zero in that FloatArray * if SOLR does not find a file for the query-Request, then SOLR still generates a HashMapEntry with score zero for every document In our case we have about 8.5 Mio. documents in our index and one of those Arrays occupies about 34MB Heap Space. Having e.g. 100 different queries and using external file field for sorting the result, SOLR occupies about 3.4GB of Heap Space. The problem might be the use of WeakHashMap [1], which prevents the Garbage Collector from cleaning up unused Keys. What do you think could be a possible solution for this whole problem? (except from don't use external file fields ;) Regards Sven [1]: A hashtable-based Map implementation with weak keys. An entry in a WeakHashMap will automatically be removed when its key is no longer in ordinary use. More precisely, the presence of a mapping for a given key will not prevent the key from being discarded by the garbage collector, that is, made finalizable, finalized, and then reclaimed. When a key has been discarded its entry is effectively removed from the map, so this class behaves somewhat differently than other Map implementations. -Ursprüngliche Nachricht- Von: mtnes...@gmail.com [mailto:mtnes...@gmail.com] Im Auftrag von Simon Rosenthal Gesendet: Mittwoch, 8. Juni 2011 03:56 An: solr-user@lucene.apache.org Betreff: Re: How to deal with many files using solr external file field Can you provide a stack trace for the OOM eexception ? On Tue, Jun 7, 2011 at 4:25 PM, Bohnsack, Sven sven.bohns...@shopping24.dewrote: Hi all, we're using solr 1.4 and external file field ([1]) for sorting our searchresults. We have about 40.000 Terms, for which we use this sorting option. Currently we're running into massive OutOfMemory-Problems and were not pretty sure, what's the matter. It seems that the garbage collector stops working or some processes are going wild. However, solr starts to allocate more and more RAM until we experience this OutOfMemory-Exception. We noticed the following: For some terms one could see in the solr log that there appear some java.io.FileNotFoundExceptions, when solr tries to load an external file for a term for which there is not such a file, e.g. solr tries to load the external score file for trousers but there ist none in the /solr/data-Folder. Question: is it possible, that those exceptions are responsible for the OutOfMemory-Problem or could it be due to the large(?) number of 40k terms for which we want to sort the result via external file field? I'm looking forward for your answers, suggestions and ideas :) Regards Sven [1]: http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html -- Martin Grotzke http://twitter.com/martin_grotzke signature.asc Description: OpenPGP digital signature
Solrj retry handling - prevent ProtocolException: Unbuffered entity enclosing request can not be repeated
Hi, from time to time we're seeing a ProtocolException: Unbuffered entity enclosing request can not be repeated. in the logs when sending ~500 docs to solr (the stack trace is at the end of the email). I'm aware that this was discussed before (e.g. [1]) and our solution was already to reduce the number of docs that are sent to solr. However, I think that the issue might be solved in solrj. This discussion on the httpclient-dev mailing list [2] points out the solution under option 3) re-instantiate the input stream and retry the request manually. AFAICS CommonsHttpSolrServer.request when _maxRetries is set to s.th. 0 (see [3]) already does some retry stuff, but not around the actual http method execution (_httpClient.executeMethod(method)). Not sure for what the several tries are implemented, but I'd say that if the user sets maxRetries to s.th. 0 also http method execution should be retried. Another thing is the actually seen ProtocolException: AFAICS this is thrown as httpclient (HttpMethodDirector.executeWithRetry) performs a retry itself (see [4]) while the actually processed HttpMethod does not support this. As HttpMethodDirector.executeWithRetry already checks for a HttpMethodRetryHandler (under param HttpMethodParams.RETRY_HANDLER, [5]), it seems as if it would be enough to add such a handler for the update/POST requests to prevent the ProtocolException. So in summary I suggest two things: 1) Retry http method execution when maxRetiries is 0 2) Prevent HttpClient from doing retries (by adding HttpMethodRetryHandler) I first wanted to post it here on the list to see if there are objections or other solutions. Or if there are plans to replace commons httpclient (3.x) by s.th. like apache httpclient 4.x or async-http-client. If there's an agreement that the proposed solution is the way to go ATM I'd submit an appropriate issue for this. Any comments? Cheers, Martin [1] http://lucene.472066.n3.nabble.com/Unbuffered-entity-enclosing-request-can-not-be-repeated-tt788186.html [2] http://www.mail-archive.com/commons-httpclient-dev@jakarta.apache.org/msg06723.html [3] http://svn.apache.org/viewvc/lucene/dev/trunk/solr/src/solrj/org/apache/solr/client/solrj/impl/CommonsHttpSolrServer.java?view=markup#l281 [4] http://svn.apache.org/viewvc/httpcomponents/oac.hc3x/trunk/src/java/org/apache/commons/httpclient/HttpMethodDirector.java?view=markup#l366 [5] http://svn.apache.org/viewvc/httpcomponents/oac.hc3x/trunk/src/java/org/apache/commons/httpclient/HttpMethodDirector.java?view=markup#l426 Stack trace: Caused by: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated. at org.apache.commons.httpclient.methods.EntityEnclosingMethod.writeRequestBody(EntityEnclosingMethod.java:487) at org.apache.commons.httpclient.HttpMethodBase.writeRequest(HttpMethodBase.java:2110) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1088) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427) -- Martin Grotzke http://twitter.com/martin_grotzke signature.asc Description: OpenPGP digital signature
Re: Use terracotta bigmemory for solr-caches
On Tue, Jan 25, 2011 at 4:19 PM, Em mailformailingli...@yahoo.de wrote: Hi Martin, are you sure that your GC is well tuned? This are the heap related jvm configurations for the servers running with 17GB heap size (one with parallel collector, one with CMS): -XX:+HeapDumpOnOutOfMemoryError -server -Xmx17G -XX:MaxPermSize=256m -XX:NewSize=2G -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -server -Xmx17G -XX:MaxPermSize=256m -XX:NewSize=2G -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC -XX:+UseParallelGC Another heap configuration is running with 8GB max heap, and this search server also has lower peaks in response times. To me it seems that it's just too much memory that gets allocated/collected/compacted. I'm just checking out how far we can reduce cache sizes (and the max heap) without any reduction of response times (and disk I/O). Right now it seems that a reduction of the documentCache size indeed does reduce the hitratio of the cache, but it does not have any negative impact on response times (neither is I/O increased). Therefore I'd follow the path of reducing the cache sizes as far as we can as long as there are no negative impacts and then I'd check again the longest requests and see if they're still caused by full GC cycles. Even then they should be much shorter due to the reduced memory that is collected/compacted. So now I also think, the terracotta bigmemory is not the right solution :-) Cheers, Martin A request that needs more than a minute isn't the standard, even when I consider all the other postings about response-performance... Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Use-terracotta-bigmemory-for-solr-caches-tp2328257p2330652.html Sent from the Solr - User mailing list archive at Nabble.com. -- Martin Grotzke http://www.javakaffee.de/blog/
Recommendation on RAM-/Cache configuration
Hi, recently we're experiencing OOMEs (GC overhead limit exceeded) in our searches. Therefore I want to get some clarification on heap and cache configuration. This is the situation: - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC -XX:+UseParallelGC - The machine has 32 GB RAM - Currently there are 4 processors/cores in the machine, this shall be changed to 2 cores in the future. - The index size in the filesystem is ~9.5 GB - The index contains ~ 5.500.000 documents - 1.500.000 of those docs are available for searches/queries, the rest are inactive docs that are excluded from searches (via a flag/field), but they're still stored in the index as need to be available by id (solr is the main document store in this app) - Caches are configured with a big size (the idea was to prevent filesystem access / disk i/o as much as possible): - filterCache (solr.LRUCache): size=20, initialSize=3, autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99 - documentCache (solr.LRUCache): size=20, initialSize=10, autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74 - queryResultCache (solr.LRUCache): size=20, initialSize=3, autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71 - Searches are performed using a catchall text field using standard request handler, all fields are fetched (no fl specified) - Normally ~ 5 concurrent requests, peaks up to 30 or 40 (mostly during GC) - Recently we also added a feature that adds weighted search for special fields, so that the query might become s.th. like this q=(some query) OR name_weighted:(some query)^2.0 OR brand_weighted:(some query)^4.0 OR longDescription_weighted:(some query)^0.5 (it seemed as if this was the cause of the OOMEs, but IMHO it only increased RAM usage so that now GC could not free enough RAM) The OOMEs that we get are of type GC overhead limit exceeded, one of the OOMEs was thrown during auto-warming. I checked two different heapdumps, the first one autogenerated (by -XX:+HeapDumpOnOutOfMemoryError) the second one generated manually via jmap. These show the following distribution of used memory - the autogenerated dump: - documentCache: 56% (size ~ 195.000) - filterCache: 15% (size ~ 60.000) - queryResultCache: 8% (size ~ 61.000) - fieldCache: 6% (fieldCache referenced by WebappClassLoader) - SolrIndexSearcher: 2% The manually generated dump: - documentCache: 48% (size ~ 195.000) - filterCache: 20% (size ~ 60.000) - fieldCache: 11% (fieldCache hängt am WebappClassLoader) - queryResultCache: 7% (size ~ 61.000) - fieldValueCache: 3% We are also running two search engines with 17GB heap, these don't run into OOMEs. Though, with these bigger heap sizes the longest requests are even longer due to longer stop-the-world gc cycles. Therefore my goal is to run with a smaller heap, IMHO even smaller than 8GB would be good to reduce the time needed for full gc. So what's the right path to follow now? What would you recommend to change on the configuration (solr/jvm)? Would you say it is ok to reduce the cache sizes? Would this increase disk i/o, or would the index be hold in the OS's disk cache? Do have other recommendations to follow / questions? Thanx cheers, Martin
Use terracotta bigmemory for solr-caches
Hi, as the biggest parts of our jvm heap are used by solr caches I asked myself if it wouldn't make sense to run solr caches backed by terracotta's bigmemory (http://www.terracotta.org/bigmemory). The goal is to reduce the time needed for full / stop-the-world GC cycles, as with our 8GB heap the longest requests take up to several minutes. What do you think? Cheers, Martin
Re: Recommendation on RAM-/Cache configuration
On Tue, Jan 25, 2011 at 2:06 PM, Markus Jelsma markus.jel...@openindex.iowrote: On Tuesday 25 January 2011 11:54:55 Martin Grotzke wrote: Hi, recently we're experiencing OOMEs (GC overhead limit exceeded) in our searches. Therefore I want to get some clarification on heap and cache configuration. This is the situation: - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC -XX:+UseParallelGC Consider switching to HotSpot JVM, use the -server as the first switch. The jvm options I mentioned were not all, we're running the jvm with -server (of course). - The machine has 32 GB RAM - Currently there are 4 processors/cores in the machine, this shall be changed to 2 cores in the future. - The index size in the filesystem is ~9.5 GB - The index contains ~ 5.500.000 documents - 1.500.000 of those docs are available for searches/queries, the rest are inactive docs that are excluded from searches (via a flag/field), but they're still stored in the index as need to be available by id (solr is the main document store in this app) How do you exclude them? It should use filter queries. The docs are indexed with a field findable on which we do a filter query. I also remember (but i just cannot find it back so please correct my if i'm wrong) that in 1.4.x sorting is done before filtering. It should be an improvement if filtering is done before sorting. Hmm, I cannot imagine a case where it makes sense to sort before filtering. Can't believe that solr does it like this. Can anyone shed a light on this? If you use sorting, it takes up a huge amount of RAM if filtering is not done first. - Caches are configured with a big size (the idea was to prevent filesystem access / disk i/o as much as possible): There is only disk I/O if the kernel can't keep the index (or parts) in its page cache. Yes, I'll keep an eye on disk I/O. - filterCache (solr.LRUCache): size=20, initialSize=3, autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99 - documentCache (solr.LRUCache): size=20, initialSize=10, autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74 - queryResultCache (solr.LRUCache): size=20, initialSize=3, autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71 You should decrease the initialSize values. But your hitratio's seem very nice. Does the initialSize have a real impact? According to http://wiki.apache.org/solr/SolrCaching#initialSize it's the initial size of the HashMap backing the cache. What would you say are reasonable values for size/initialSize/autowarmCount? Cheers, Martin
Re: Rebuild Spellchecker based on cron expression
On Mon, Dec 13, 2010 at 4:01 AM, Erick Erickson erickerick...@gmail.com wrote: I'm shooting in the dark here, but according to this: http://wiki.apache.org/solr/SolrReplication http://wiki.apache.org/solr/SolrReplicationafter the slave pulls the index down, it issues a commit. So if your slave is configured to generate the dictionary on commit, will it just happen? Our slaves spellcheckers are not configured to buildOnCommit, therefore it shouldn't just happen. But according to this: https://issues.apache.org/jira/browse/SOLR-866 https://issues.apache.org/jira/browse/SOLR-866this is an open issue Thanx for the pointer! SOLR-866 is even better suited for us - after reading SOLR-433 again I realized that it targets scripts based replication (what we're going to leave behind us). Cheers, Martin Best Erick On Sun, Dec 12, 2010 at 8:30 PM, Martin Grotzke martin.grot...@googlemail.com wrote: On Mon, Dec 13, 2010 at 2:12 AM, Markus Jelsma markus.jel...@openindex.io wrote: Maybe you've overlooked the build parameter? http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.build I'm aware of this, but we don't want to maintain cron-jobs on all slaves for all spellcheckers for all cores. That's why I'm thinking about a more integrated solution. Or did I really overlook s.th.? Cheers, Martin Hi, the spellchecker component already provides a buildOnCommit and buildOnOptimize option. Since we have several spellchecker indices building on each commit is not really what we want to do. Building on optimize is not possible as index optimization is done on the master and the slaves don't even run an optimize but only fetch the optimized index. Therefore I'm thinking about an extension of the spellchecker that allows you to rebuild the spellchecker based on a cron-expression (e.g. rebuild each night at 1 am). What do you think about this, is there anybody else interested in this? Regarding the lifecycle, is there already some executor framework or any regularly running process in place, or would I have to pull up my own thread? If so, how can I stop my thread when solr/tomcat is shutdown (I couldn't see any shutdown or destroy method in SearchComponent)? Thanx for your feedback, cheers, Martin -- Martin Grotzke http://twitter.com/martin_grotzke -- Martin Grotzke http://www.javakaffee.de/blog/
Re: Rebuild Spellchecker based on cron expression
Hi Erick, thanx for your advice! I'll check the options with our client and see how we'll proceed. My spare time right now is already full with other open source stuff, otherwise it'd be fun contributing s.th. to solr! :-) Cheers, Martin On Mon, Dec 13, 2010 at 2:46 PM, Erick Erickson erickerick...@gmail.com wrote: *** Just wondering what's the reason that this patch receives that little interest. Anything wrong with it? *** Nobody got behind it and pushed I suspect. And since it's been a long time since it was updated, there's no guarantee that it would apply cleanly any more. Or that it will perform as intended. So, if you're really interested, I'd suggest you ping the dev list and ask whether this is valuable or if it's been superseded. If the feedback is that this would be valuable, you can see what you can do to make it happen. Once it's working to your satisfaction and you've submitted a patch, let people know it's ready and ask them to commit it or critique it. You might have to remind the committers after a few days that it's ready and get it applied to trunk and/or 3.x. But I really wouldn't start working with it until I got some feedback from the people who are actively working on Solr whether it's been superseded by other functionality first, sometimes bugs just aren't closed when something else makes it obsolete. Here's a place to start: http://wiki.apache.org/solr/HowToContribute Best Erick On Mon, Dec 13, 2010 at 2:58 AM, Martin Grotzke martin.grot...@googlemail.com wrote: Hi, when thinking further about it it's clear that https://issues.apache.org/jira/browse/SOLR-433 would be even better - we could generate the spellechecker indices on commit/optimize on the master and replicate them to all slaves. Just wondering what's the reason that this patch receives that little interest. Anything wrong with it? Cheers, Martin On Mon, Dec 13, 2010 at 2:04 AM, Martin Grotzke martin.grot...@googlemail.com wrote: Hi, the spellchecker component already provides a buildOnCommit and buildOnOptimize option. Since we have several spellchecker indices building on each commit is not really what we want to do. Building on optimize is not possible as index optimization is done on the master and the slaves don't even run an optimize but only fetch the optimized index. Therefore I'm thinking about an extension of the spellchecker that allows you to rebuild the spellchecker based on a cron-expression (e.g. rebuild each night at 1 am). What do you think about this, is there anybody else interested in this? Regarding the lifecycle, is there already some executor framework or any regularly running process in place, or would I have to pull up my own thread? If so, how can I stop my thread when solr/tomcat is shutdown (I couldn't see any shutdown or destroy method in SearchComponent)? Thanx for your feedback, cheers, Martin -- Martin Grotzke http://www.javakaffee.de/blog/ -- Martin Grotzke http://www.javakaffee.de/blog/
Rebuild Spellchecker based on cron expression
Hi, the spellchecker component already provides a buildOnCommit and buildOnOptimize option. Since we have several spellchecker indices building on each commit is not really what we want to do. Building on optimize is not possible as index optimization is done on the master and the slaves don't even run an optimize but only fetch the optimized index. Therefore I'm thinking about an extension of the spellchecker that allows you to rebuild the spellchecker based on a cron-expression (e.g. rebuild each night at 1 am). What do you think about this, is there anybody else interested in this? Regarding the lifecycle, is there already some executor framework or any regularly running process in place, or would I have to pull up my own thread? If so, how can I stop my thread when solr/tomcat is shutdown (I couldn't see any shutdown or destroy method in SearchComponent)? Thanx for your feedback, cheers, Martin
Re: Rebuild Spellchecker based on cron expression
On Mon, Dec 13, 2010 at 2:12 AM, Markus Jelsma markus.jel...@openindex.io wrote: Maybe you've overlooked the build parameter? http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.build I'm aware of this, but we don't want to maintain cron-jobs on all slaves for all spellcheckers for all cores. That's why I'm thinking about a more integrated solution. Or did I really overlook s.th.? Cheers, Martin Hi, the spellchecker component already provides a buildOnCommit and buildOnOptimize option. Since we have several spellchecker indices building on each commit is not really what we want to do. Building on optimize is not possible as index optimization is done on the master and the slaves don't even run an optimize but only fetch the optimized index. Therefore I'm thinking about an extension of the spellchecker that allows you to rebuild the spellchecker based on a cron-expression (e.g. rebuild each night at 1 am). What do you think about this, is there anybody else interested in this? Regarding the lifecycle, is there already some executor framework or any regularly running process in place, or would I have to pull up my own thread? If so, how can I stop my thread when solr/tomcat is shutdown (I couldn't see any shutdown or destroy method in SearchComponent)? Thanx for your feedback, cheers, Martin -- Martin Grotzke http://twitter.com/martin_grotzke
Re: Rebuild Spellchecker based on cron expression
Hi, when thinking further about it it's clear that https://issues.apache.org/jira/browse/SOLR-433 would be even better - we could generate the spellechecker indices on commit/optimize on the master and replicate them to all slaves. Just wondering what's the reason that this patch receives that little interest. Anything wrong with it? Cheers, Martin On Mon, Dec 13, 2010 at 2:04 AM, Martin Grotzke martin.grot...@googlemail.com wrote: Hi, the spellchecker component already provides a buildOnCommit and buildOnOptimize option. Since we have several spellchecker indices building on each commit is not really what we want to do. Building on optimize is not possible as index optimization is done on the master and the slaves don't even run an optimize but only fetch the optimized index. Therefore I'm thinking about an extension of the spellchecker that allows you to rebuild the spellchecker based on a cron-expression (e.g. rebuild each night at 1 am). What do you think about this, is there anybody else interested in this? Regarding the lifecycle, is there already some executor framework or any regularly running process in place, or would I have to pull up my own thread? If so, how can I stop my thread when solr/tomcat is shutdown (I couldn't see any shutdown or destroy method in SearchComponent)? Thanx for your feedback, cheers, Martin -- Martin Grotzke http://www.javakaffee.de/blog/
Re: Multicore and Replication (scripts vs. java, spellchecker)
On Sat, Dec 11, 2010 at 12:38 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : #SOLR-433 MultiCore and SpellChecker replication [1]. Based on the : status of this feature request I'd asume that the normal procedure of : keeping the spellchecker index up2date would be running a cron job on : each node/slave that updates the spellchecker. : Is that right? i'm not 100% certain, but i suspect a lot of people just build the spellcheck dictionaries on the slave machines (redundently) using buildOnCommit http://wiki.apache.org/solr/SpellCheckComponent#Building_on_Commits Ok, also a good option. Though, for us this is not that perfect because we have 4 different spellcheckers configured so that this would eat some cpu that we'd prefer to have left for searching. I think what would be desirable (in our case) is s.th. like rebuilding the spellchecker based on a cron expression, so that we could recreate it e.g. every night at 1 am. When thinking about creating s.th. like this, do you have some advice where I could have a look at in solr? Is there already some framework for running regular tasks, or should I pull up my own Timer/TimerTask etc. and create it from scratch? Cheers, Martin -Hoss -- Martin Grotzke http://www.javakaffee.de/blog/
Re: Multicore and Replication (scripts vs. java, spellchecker)
Hi, that there's no feedback indicates that our plans/preferences are fine. Otherwise it's now a good opportunity to feed back :-) Cheers, Martin On Wed, Dec 8, 2010 at 2:48 PM, Martin Grotzke martin.grot...@googlemail.com wrote: Hi, we're just planning to move from our replicated single index setup to a replicated setup with multiple cores. We're going to start with 2 cores, but the number of cores may change/increase over time. Our replication is still based on scripts/rsync, and I'm wondering if it's worth moving to java based replication. AFAICS the main advantage is simplicity, as with scripts based replication our operations team would have to maintain rsync daemons / cron jobs for each core. Therefore my own preference would be to drop scripts and chose the java based replication. I'd just wanted to ask for experiences with the one or another in a multicore setup. What do you say? Another question is regarding spellchecker replication. I know there's #SOLR-433 MultiCore and SpellChecker replication [1]. Based on the status of this feature request I'd asume that the normal procedure of keeping the spellchecker index up2date would be running a cron job on each node/slave that updates the spellchecker. Is that right? And a final one: are there other things we should be aware of / keep in mind when planning the migration to multiple cores? (Ok, I'm risking to get ask specific questions! as an answer, but perhaps s.o. has interesting, related stories to tell :-)) Thanx in advance, cheers, Martin [1] https://issues.apache.org/jira/browse/SOLR-433 -- Martin Grotzke http://www.javakaffee.de/blog/
Multicore and Replication (scripts vs. java, spellchecker)
Hi, we're just planning to move from our replicated single index setup to a replicated setup with multiple cores. We're going to start with 2 cores, but the number of cores may change/increase over time. Our replication is still based on scripts/rsync, and I'm wondering if it's worth moving to java based replication. AFAICS the main advantage is simplicity, as with scripts based replication our operations team would have to maintain rsync daemons / cron jobs for each core. Therefore my own preference would be to drop scripts and chose the java based replication. I'd just wanted to ask for experiences with the one or another in a multicore setup. What do you say? Another question is regarding spellchecker replication. I know there's #SOLR-433 MultiCore and SpellChecker replication [1]. Based on the status of this feature request I'd asume that the normal procedure of keeping the spellchecker index up2date would be running a cron job on each node/slave that updates the spellchecker. Is that right? And a final one: are there other things we should be aware of / keep in mind when planning the migration to multiple cores? (Ok, I'm risking to get ask specific questions! as an answer, but perhaps s.o. has interesting, related stories to tell :-)) Thanx in advance, cheers, Martin [1] https://issues.apache.org/jira/browse/SOLR-433
Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param
On Tue, Nov 30, 2010 at 7:51 PM, Martin Grotzke martin.grot...@googlemail.com wrote: On Tue, Nov 30, 2010 at 3:09 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke martin.grot...@googlemail.com wrote: Still I'm wondering, why this issue does not occur with the plain example solr setup with 2 indexed docs. Any explanation? It's an old option you have in your solrconfig.xml that causes a different code path to be followed in Solr: !-- An optimization that attempts to use a filter to satisfy a search. If the requested sort does not include score, then the filterCache will be checked for a filter matching the query. If found, the filter will be used as the source of document ids, and then the sort will be applied to that. -- useFilterForSortedQuerytrue/useFilterForSortedQuery Most apps would be better off commenting that out or setting it to false. It only makes sense when a high number of queries will be duplicated, but with different sorts. Great, this sounds really promising, would be a very easy fix. I need to check this tomorrow on our test/integration server if changing this does the trick for us. I just verified this fix on our test/integration system and it works - cool! Thanx a lot for this hint, cheers, Martin
Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param
On Tue, Nov 30, 2010 at 10:29 AM, Michael McCandless luc...@mikemccandless.com wrote: Hmm this is in fact a regression. TopFieldCollector expects (but does not verify) that numHits is 0. I guess to fix this we could fix TopFieldCollector.create to return a NullCollector when numHits is 0. Fixing this in lucene/solr sounds good :-) Still I'm wondering, why this issue does not occur with the plain example solr setup with 2 indexed docs. Any explanation? But: why is your app doing this? Ie, if numHits (rows) is 0, the only useful thing you can get is totalHits? Actually I don't know this (yet). Normally our search logic should optimize this and ignore a requested sorting with rows=0, but there seems to be a case that circumvents this - still figuring out. Still I think we should fix it in Lucene -- it's a nuisance to push such corner case checks up into the apps. I'll open an issue... Just for the record, this is https://issues.apache.org/jira/browse/LUCENE-2785 One question: as leaving out sorting leads to better performance, this should also be true for rows=0. Or is lucene/solr already that clever that it makes this optimization (ignoring sort) automatically? Do I understand it correctly, that the solution with the null collector would make this optimiztion? We're just asking ourselves if we should go ahead and analyze and fix this in our app or wait for a patch for solr/lucene. What do you think? Is there s.th. like a timeframe when there's an agreement on the correct solution and a patch available? Thanx cheers, Martin Mike On Mon, Nov 29, 2010 at 7:14 AM, Martin Grotzke martin.grot...@googlemail.com wrote: Hi, after an upgrade from solr-1.3 to 1.4.1 we're getting an ArrayIndexOutOfBoundsException for a query with rows=0 and a sort param specified: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:660) at org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.collect(TopFieldCollector.java:84) at org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1391) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:872) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) The query is e.g.: /select/?sort=popularity+descrows=0start=0q=foo When this is changed to rows=1 or when the sort param is removed the exception is gone and everything's fine. With a clean 1.4.1 installation (unzipped, started example and posted two documents as described in the tutorial) this issue is not reproducable. Does anyone have a clue what might be the reason for this and how we could fix this on the solr side? Of course - for a quick fix - I'll change our app so that there's no sort param specified when rows=0. Thanx cheers, Martin -- Martin Grotzke http://twitter.com/martin_grotzke -- Martin Grotzke http://www.javakaffee.de/blog/
Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param
On Tue, Nov 30, 2010 at 3:09 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke martin.grot...@googlemail.com wrote: Still I'm wondering, why this issue does not occur with the plain example solr setup with 2 indexed docs. Any explanation? It's an old option you have in your solrconfig.xml that causes a different code path to be followed in Solr: !-- An optimization that attempts to use a filter to satisfy a search. If the requested sort does not include score, then the filterCache will be checked for a filter matching the query. If found, the filter will be used as the source of document ids, and then the sort will be applied to that. -- useFilterForSortedQuerytrue/useFilterForSortedQuery Most apps would be better off commenting that out or setting it to false. It only makes sense when a high number of queries will be duplicated, but with different sorts. Great, this sounds really promising, would be a very easy fix. I need to check this tomorrow on our test/integration server if changing this does the trick for us. Though, I just enabled useFilterForSortedQuery in the solr 1.4.1 example and tested rows=0 with a sort param, which doesn't fail - a correct/valid result is returned. Is there any condition that has to be met additionally to produce the error? One question: as leaving out sorting leads to better performance, this should also be true for rows=0. Or is lucene/solr already that clever that it makes this optimization (ignoring sort) automatically? Solr has always special-cased this case and avoided sorting altogether Great, good to know! Cheers, Martin
ArrayIndexOutOfBoundsException for query with rows=0 and sort param
Hi, after an upgrade from solr-1.3 to 1.4.1 we're getting an ArrayIndexOutOfBoundsException for a query with rows=0 and a sort param specified: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:660) at org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.collect(TopFieldCollector.java:84) at org.apache.solr.search.SolrIndexSearcher.sortDocSet(SolrIndexSearcher.java:1391) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:872) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) The query is e.g.: /select/?sort=popularity+descrows=0start=0q=foo When this is changed to rows=1 or when the sort param is removed the exception is gone and everything's fine. With a clean 1.4.1 installation (unzipped, started example and posted two documents as described in the tutorial) this issue is not reproducable. Does anyone have a clue what might be the reason for this and how we could fix this on the solr side? Of course - for a quick fix - I'll change our app so that there's no sort param specified when rows=0. Thanx cheers, Martin -- Martin Grotzke http://twitter.com/martin_grotzke
Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time
Thanx for your help so far, I just wanted to post my results here... In short: Now I use the ShingleFilter to create shingles when copying my fields into my field spellMultiWords. For query time, I implemented a MultiWordSpellingQueryConverter that just leaves the query as is, so that there's only one token that is check for spelling suggestions. Here's the detailed configuration: = schema.xml = fieldType name=textSpellMultiWords class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=3 outputUnigrams=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType field name=spellMultiWords type=textSpellMultiWords indexed=true stored=true multiValued=true/ copyField source=name dest=spellMultiWords / copyField source=cat dest=spellMultiWords / ... and more ... = solrconfig.xml = searchComponent name=spellcheckMultiWords class=solr.SpellCheckComponent !-- this is not used at all, can probably be omitted -- str name=queryAnalyzerFieldTypetextSpellMultiWords/str lst name=spellchecker !-- Optional, it is required when more than one spellchecker is configured -- str name=namedefault/str str name=fieldspellMultiWords/str str name=spellcheckIndexDir./spellcheckerMultiWords1/str str name=accuracy0.5/str str name=buildOnCommittrue/str /lst lst name=spellchecker str name=namejarowinkler/str str name=fieldspellMultiWords/str str name=distanceMeasureorg.apache.lucene.search.spell.JaroWinklerDistance/str str name=spellcheckIndexDir./spellcheckerMultiWords2/str str name=buildOnCommittrue/str /lst /searchComponent queryConverter name=queryConverter class=my.proj.solr.MultiWordSpellingQueryConverter/ = MultiWordSpellingQueryConverter = public class MultiWordSpellingQueryConverter extends QueryConverter { /** * Converts the original query string to a collection of Lucene Tokens. * * @param original the original query string * @return a Collection of Lucene Tokens */ public CollectionToken convert( String original ) { if ( original == null ) { return Collections.emptyList(); } final Token token = new Token(0, original.length()); token.setTermBuffer( original ); return Arrays.asList( token ); } } There are some issues still to be resolved: - terms are lowercased in the index, there should happen some case restoration - we use stemming for our text field, so the spellchecker might suggest searches, that lead to equal search results (e.g. the german2 stemmer stems both hose and hosen to hos - Hose and Hosen give the same results) - inconsistent/strange sorting of suggestions (as described in http://www.nabble.com/spellcheck%3A-issues-td19845539.html). Cheers, Martin On Mon, 2008-10-06 at 22:45 +0200, Martin Grotzke wrote: On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote: On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote: Hi Jason, what about multi-word searches like harry potter? When I do a search in our index for harry poter, I get the suggestion harry spotter (using spellcheck.collate=true and jarowinkler distance). Searching for harry spotter (we're searching AND, not OR) then gives no results. I asume that this is because suggestions are done for words separately, and this does not require that both/all suggestions are contained in the same document. Yeah, the SpellCheckComponent is not phrase aware. My guess would be that you would somehow need a QueryConverter (see http://wiki.apache.org/solr/SpellCheckComponent) that preserved phrases as a single token. Likewise, you would need that on your indexing side as well for the spell checker. In short, I suppose it's possible, but it would be work. You probably could use the shingle filter (token based n-grams). I also thought about s.th. like this, and also stumbled over the ShingleFilter :) So I would change the spell field to use the ShingleFilter? Did I understand the answer to the posting chaining copyFields correctly, that I cannot pipe the title through some shingledTitle field and copy it afterwards to the spell field (while other fields like brand are copied directly to the spell field)? Thanx cheers, Martin Alternatively, by using extendedResults, you can get back the frequency of each of the words, and then you could decide whether the collation is going to have any results assuming they are all or'd together. For phrases and AND queries, I'm not sure. It's doable, I'm sure, but it would be a lot more involved. I wonder what's the standard approach for searches with multiple words
Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time
Hi Jason, what about multi-word searches like harry potter? When I do a search in our index for harry poter, I get the suggestion harry spotter (using spellcheck.collate=true and jarowinkler distance). Searching for harry spotter (we're searching AND, not OR) then gives no results. I asume that this is because suggestions are done for words separately, and this does not require that both/all suggestions are contained in the same document. I wonder what's the standard approach for searches with multiple words. Are these working ok for you? Cheers, Martin On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote: Hi Martin, I'm a relative newbie to solr, have been playing with the spellcheck component and seem to have it working. I certainly can't explain what all is going on, but with any luck, I can help you get the spellchecker up-and-running. Additional replies in-lined below. On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke [EMAIL PROTECTED] wrote: Now I'm thinking about the source-field in the spellchecker (spell): how should fields be analyzed during indexing, and how should the queryAnalyzerFieldType be configured. I followed the conventions in the default solrconfig.xml and schema.xml files. So I created a textSpell field type (schema.xml): !-- field type for the spell checker which doesn't stem -- fieldtype name=textSpell class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype and used this for the queryAnalyzerFieldType. I also created a spellField to store the text I want to spell check against and used the same analyzer (figuring that the query and indexed data should be analyzed the same way) (schema.xml): !-- Spell check field -- field name=spellField type=textSpell indexed=true stored=true / If I have brands like e.g. Apple or Ed Hardy I would copy them (the field brand) directly to the spell field. The spell field is of type string. We're copying description to spellField. I'd recommend using a type like the above textSpell type since The StringField type is not analyzed, but indexed/stored verbatim (schema.xml): copyField source=description dest=spellField / Other fields like e.g. the product title I would first copy to some whitespaceTokinized field (field type with WhitespaceTokenizerFactory) and afterwards to the spell field. The product title might be e.g. Canon EOS 450D EF-S 18-55 mm. Hmm... I'm not sure if this would work as I don't think the analyzer is applied until after the copy is made. FWIW, I've had trouble copying multipe fields to spellField (i.e. adding a second copyField w/ dest=spellField), so we just index the spellchecker on a single field... Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a StandardTokenizerFactory here? I think if you use the same analyzer for indexing and queries, the distinction probably isn't tremendously important. When I went searching, it looked like the StandardTokenizer split on non-letters. I'd guess the rationale for using the StandardTokenizer is that it won't recommend non-letter characters. I was seeing some weirdness earlier (no inserts/deletes), but that disappeared now that I'm using the StandardTokenizer. Cheers, Jason -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
Re: How to tokenize/analyze docs for the spellchecker - at indexing and query time
On Mon, 2008-10-06 at 09:00 -0400, Grant Ingersoll wrote: On Oct 6, 2008, at 3:51 AM, Martin Grotzke wrote: Hi Jason, what about multi-word searches like harry potter? When I do a search in our index for harry poter, I get the suggestion harry spotter (using spellcheck.collate=true and jarowinkler distance). Searching for harry spotter (we're searching AND, not OR) then gives no results. I asume that this is because suggestions are done for words separately, and this does not require that both/all suggestions are contained in the same document. Yeah, the SpellCheckComponent is not phrase aware. My guess would be that you would somehow need a QueryConverter (see http://wiki.apache.org/solr/SpellCheckComponent) that preserved phrases as a single token. Likewise, you would need that on your indexing side as well for the spell checker. In short, I suppose it's possible, but it would be work. You probably could use the shingle filter (token based n-grams). I also thought about s.th. like this, and also stumbled over the ShingleFilter :) So I would change the spell field to use the ShingleFilter? Did I understand the answer to the posting chaining copyFields correctly, that I cannot pipe the title through some shingledTitle field and copy it afterwards to the spell field (while other fields like brand are copied directly to the spell field)? Thanx cheers, Martin Alternatively, by using extendedResults, you can get back the frequency of each of the words, and then you could decide whether the collation is going to have any results assuming they are all or'd together. For phrases and AND queries, I'm not sure. It's doable, I'm sure, but it would be a lot more involved. I wonder what's the standard approach for searches with multiple words. Are these working ok for you? Cheers, Martin On Fri, 2008-10-03 at 16:21 -0400, Jason Rennie wrote: Hi Martin, I'm a relative newbie to solr, have been playing with the spellcheck component and seem to have it working. I certainly can't explain what all is going on, but with any luck, I can help you get the spellchecker up-and-running. Additional replies in-lined below. On Wed, Oct 1, 2008 at 7:11 AM, Martin Grotzke [EMAIL PROTECTED] wrote: Now I'm thinking about the source-field in the spellchecker (spell): how should fields be analyzed during indexing, and how should the queryAnalyzerFieldType be configured. I followed the conventions in the default solrconfig.xml and schema.xml files. So I created a textSpell field type (schema.xml): !-- field type for the spell checker which doesn't stem -- fieldtype name=textSpell class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype and used this for the queryAnalyzerFieldType. I also created a spellField to store the text I want to spell check against and used the same analyzer (figuring that the query and indexed data should be analyzed the same way) (schema.xml): !-- Spell check field -- field name=spellField type=textSpell indexed=true stored=true / If I have brands like e.g. Apple or Ed Hardy I would copy them (the field brand) directly to the spell field. The spell field is of type string. We're copying description to spellField. I'd recommend using a type like the above textSpell type since The StringField type is not analyzed, but indexed/stored verbatim (schema.xml): copyField source=description dest=spellField / Other fields like e.g. the product title I would first copy to some whitespaceTokinized field (field type with WhitespaceTokenizerFactory) and afterwards to the spell field. The product title might be e.g. Canon EOS 450D EF-S 18-55 mm. Hmm... I'm not sure if this would work as I don't think the analyzer is applied until after the copy is made. FWIW, I've had trouble copying multipe fields to spellField (i.e. adding a second copyField w/ dest=spellField), so we just index the spellchecker on a single field... Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a StandardTokenizerFactory here? I think if you use the same analyzer for indexing and queries, the distinction probably isn't tremendously important. When I went searching, it looked like the StandardTokenizer split on non-letters. I'd guess the rationale for using the StandardTokenizer is that it won't recommend non-letter characters. I was seeing some weirdness earlier (no inserts/deletes), but that disappeared now that I'm using the StandardTokenizer. Cheers, Jason -- Martin Grotzke http://www.javakaffee.de
How to tokenize/analyze docs for the spellchecker - at indexing and query time
Hi, I'm just starting with the spellchecker component provided by solr - it is really cool! Now I'm thinking about the source-field in the spellchecker (spell): how should fields be analyzed during indexing, and how should the queryAnalyzerFieldType be configured. If I have brands like e.g. Apple or Ed Hardy I would copy them (the field brand) directly to the spell field. The spell field is of type string. Other fields like e.g. the product title I would first copy to some whitespaceTokinized field (field type with WhitespaceTokenizerFactory) and afterwards to the spell field. The product title might be e.g. Canon EOS 450D EF-S 18-55 mm. This is the process I have in mind during indexing (though I'm not sure if some tokens/terms should be removed, but I'd asume that all terms might be misspelled by the user). Now when it comes to searching, the query should be analyzed using the queryAnalyzerFieldType definition, which has a StandardTokenizerFactory in the schema.xml of the solr example. Shouldn't this be a WhitespaceTokenizerFactory, or is it better to use a StandardTokenizerFactory here? Or should I use a StandardTokenizerFactory for the spell field, so that fields copied into this field get tokenized/analyzed in the same way as the query will get tokenized/analyzed? Do you have any experience with this and/or recommendations regarding this? Are there other things to consider? Thanx for your help, cheers, Martin signature.asc Description: This is a digitally signed message part
Re: prefix-search ingnores the lowerCaseFilter
On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote: On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote: Is it possible that the prefix-processing ignores the filters? Yes, It's a known limitation that we haven't worked out a fix for yet. The issue is that you can't just run the prefix through the filters because of things like stop words, stemming, minimum length filters, etc. What about not having only facet.prefix but additionally facet.filtered.prefix that runs the prefix through the filters? Would that be possible? Cheers, Martin -Yonik signature.asc Description: This is a digitally signed message part
Re: prefix-search ingnores the lowerCaseFilter
On Mon, 2007-10-29 at 13:31 -0400, Yonik Seeley wrote: On 10/29/07, Martin Grotzke [EMAIL PROTECTED] wrote: On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote: On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote: Is it possible that the prefix-processing ignores the filters? Yes, It's a known limitation that we haven't worked out a fix for yet. The issue is that you can't just run the prefix through the filters because of things like stop words, stemming, minimum length filters, etc. What about not having only facet.prefix but additionally facet.filtered.prefix that runs the prefix through the filters? Would that be possible? The underlying issue remains - it's not safe to treat the prefix like any other word when running it through the filters. Yes, definitely the user that uses this feature should know what it does - but at least there would be the possibility to run the prefix through e.g. a LowerCaseFilter. Finally the user knows what filters he has configured. E.g. if you only want an ignore-case prefix test, s.th. like a facet.filtered.prefix would be really valuable. Cheers, Martin -Yonik signature.asc Description: This is a digitally signed message part
type ahead - suggest words with facet.prefix, but with original case (or another solution?)
Hello, I'm just thinking about a solution for a type ahead functionality that shall suggest terms that the user can search for, and that displays how many docs are behind that search (like google suggest). When I use facet.prefix and facet.field=text, where text is my catchall field (and default field for searching), then only lowercased words are suggested, not orgininal ones. And I want to have it independent from the users input - it should not matter if the user enters fo or Fo, I always want to have Foo suggested if this words exists in my docs. Is that possible? AFAICS the limitation of this approach is, that it is limited to single words. E.g. when the user enters foo ba, then he would not get Foo Bar as a suggestion (asuming that my catchall field contains tokenized terms). What do you think of this: Asuming I have my own RequestHandler, I would split the users input to get the last word, and use everything but this last word as query, to limit the resulting docs (my default operator is AND). Afterwards I search for terms starting with the last word and do standard faceting stuff (calculate number of docs for each term). Are there other/better approaches/solutions for type ahead functionality that you would recommend? Btw: my docs contain products with the main fields name, cat, type, tags, brand, color - these are used for searching (copied into the text field). Thanx in advance, cheers, Martin signature.asc Description: This is a digitally signed message part
Re: Different search results for (german) singular/plural searches - looking for a solution
Hi, now I played around with the snowball porter stemmer and it definitely feels really good (used German2 as suggested). For some cases (e.g. product types like top/tops, bermuda/bermudas or hoody/hoodies) additionally we need synonyms. At first I thought it would be good to use synonyms only at query time, but the docs in the wiki recommend to expand synonyms at index time... What are your experiences? Would you also suggest to use them when indexing? On Thu, 2007-10-11 at 17:32 +0200, Thomas Traeger wrote: Martin Grotzke schrieb: Try the SnowballPorterFilterFactory with German2 as language attribute first and use synonyms for combined words i.e. Herrenhose = Herren, Hose. so you use a combined approach? Yes, we define the relevant parts of compounded words (keywords only) as synonyms and feed them in a special field that is used for searching and for the product index. So you don't use a single catchall field text? What is the reason for this, what is the advantage? I hope there will be a filter that can split compounded word sometimes in the future... There is no standard approach for handling this problem apart from synonyms? This is exactly what jwordsplitter does (as posted by Daniel)... Thanx cheers, Martin By using stemming you will maybe have some interesting results, but it is much better living with them than having no or much less results ;o) Do you have an example what interesting results I can expect, just to get an idea? Find more infos on the Snowball stemming algorithms here: http://snowball.tartarus.org/ Thanx! I also had a look at this site already, but what is missing is a demo where one can see what's happening. I think I'll play a little with stemming to get a feeling for this. I think the Snowball stemmer is very good so I have no practical example for you. Maybe this is of value to see what happens: http://snowball.tartarus.org/algorithms/german/diffs.txt If you have mixed languages in your content, which sometimes happens in product data, you might get into some trouble. Also have a look at the StopFilterFactory, here is a sample stopwordlist for the german language: http://snowball.tartarus.org/algorithms/german/stop.txt Our application handles products, do you think such stopwords are useful in this scenario also? I wouldn't expect a user to search for keine hose or s.th. like this :) I have seen much worse queries, so you never know ;o) think of a query like this: Hose in blau für Herren You will definetly want to remove in and für during searching and it reduces index size when removed during indexing. Maybe you will even get better scores when only relevant terms are used. You should optimze the stopword list based on your data. Regards, Tom -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
Re: Different search results for (german) singular/plural searches - looking for a solution
Hi Daniel, thanx for your suggestions, being able to export a large synonyms.txt sounds very well! Thx cheers, Martin On Wed, 2007-10-10 at 23:38 +0200, Daniel Naber wrote: On Wednesday 10 October 2007 12:00, Martin Grotzke wrote: Basically I see two options: stemming and the usage of synonyms. Are there others? A large list of German words and their forms is available from a Windows software called Morphy (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). You can use it for mapping fullforms to base forms (Häuser - Haus). You can also have a look at www.languagetool.org which uses this data in a Java software. LanguageTool also comes with jWordSplitter, which can find a compound's parts (Autowäsche - Auto + Wäsche). Regards Daniel signature.asc Description: This is a digitally signed message part
Different search results for (german) singular/plural searches - looking for a solution
Hello, with our application we have the issue, that we get different results for singular and plural searches (german language). E.g. for hose we get 1.000 documents back, but for hosen we get 10.000 docs. The same applies to t-shirt or t-shirts, of e.g. hut and hüte - lots of cases :) This is absolutely correct according to the schema.xml, as right now we do not have any stemming or synonyms included. Now we want to have similar search results for these singular/plural searches. I'm thinking of a solution for this, and want to ask, what are your experiences with this. Basically I see two options: stemming and the usage of synonyms. Are there others? My concern with stemming is, that it might produce unexpected results, so that docs are found that do not match the query from the users point of view. I asume that this needs a lot of testing with different data. The issue with synonyms is, that we would have to create a file containing all synonyms, so we would have to figure out all cases, in contrast to a solutions that is based on an algorithm. The advantage of this approach is IMHO, that it is very predictable which results will be returned for a certain query. Some background information: Our documents contain products (id, name, brand, category, producttype, description, color etc). The singular/plural issue basically applied to the fields name, category and producttype, so we would like to restrict the solution to these fields. Do you have suggestions how to handle this? Thanx in advance for sharing your experiences, cheers, Martin - Extracts of our schema.xml: types fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype fieldType name=trimmedString class=solr.TextField sortMissingLast=true omitNorms=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.TrimFilterFactory / /analyzer !-- we should also configure lcasing for index and query analyzer -- /fieldType /types fields field name=name type=text indexed=true stored=true/ field name=cat type=trimmedString indexed=true stored=true multiValued=true omitNorms=true/ field name=type type=trimmedString indexed=true stored=true multiValued=false omitNorms=true/ /fields defaultSearchFieldtext/defaultSearchField copyField source=tag dest=text/ copyField source=cat dest=text/ copyField source=name dest=text/ copyField source=type dest=text / copyField source=brand dest=text / - signature.asc Description: This is a digitally signed message part
Re: How to extract constrained fields from query
On Thu, 2007-08-23 at 10:44 -0700, Chris Hostetter wrote: : Probably I'm also interested in PrefixQueries, as they also provide a : Term, e.g. parsing ipod AND brand:apple gives a PrefixQuery for : brand:apple. uh? ... it shoudn't, not unless we're talking about some other customization you've already made. My fault, this is returned for s.th. like brand:appl* - but perhaps I would also like to facet on such fields then... : I want to do s.th. like dynamic faceting - so that the solr client : does not have to request facets via facet.field, but that I can decide : in my CustomRequestHandler which facets are returned. But I want to : return only facets for fields that are not already constained, e.g. : when the query contains s.th. like brand:apple I don't want to return : a facet for the field brand. Hmmm, i see ... well the easiest way to go is not to worry about it when parsing the query, when you go to compute facets for all hte fields you tink might be useful, you'll see that only one value for brand matches, and you can just skip it. I would think that this is not the best option in terms of performance. that doesn't really work well for range queries -- but you can't exactly use the same logic for picking what your facet contraints will be on a field that makes sense to do a rnage query on anyway, so it's tricky either way. the custom QueryParser is still probably your best bet... : Ok, so I would override getFieldQuery, getPrefixQuery, getRangeQuery and : getWildcardQuery(?) and record the field names? And I would use this : QueryParser for both parsing of the query (q) and the filter queries : (fq)? yep. Alright, then I'll choose this door. (Also Note there is also an extractTerms method on Query that can help in some cases, but the impl for ConstantScoreQuery (which is used when the SolrQueryParser sees a range query or a prefix query) doesn't really work at the moment.) Yep, I already had tried this, but it always failed with an UnsupportedOperationException... Thanx a lot, cheers, Martin -Hoss -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
How to extract constrained fields from query
Hello, in my custom request handler, I want to determine which fields are constrained by the user. E.g. the query (q) might be ipod AND brand:apple and there might be a filter query (fq) like color:white (or more). What I want to know is that brand and color are constrained. AFAICS I could use SolrPluginUtils.parseFilterQueries and test if the queries are TermQueries and read its Field. Then should I also test which kind of queries I get when parsing the query (q) and look for all TermQueries from the parsed query? Or is there a more elegant way of doing this? Thanx a lot, cheers, Martin signature.asc Description: This is a digitally signed message part
RE: How to read values of a field efficiently
On Tue, 2007-08-21 at 11:52 +0200, Ard Schrijvers wrote: you're missing the key piece that Ard alluded to ... the there is one ordere list of all terms stored in the index ... a TermEnum lets you iterate over this ordered list, and the IndexReader.terms(Term) method lets you efficiently start at an arbitrary term. if you are only interested in terms for a specific field, once your TermEnum returns a differnet field, you can stop -- you will never get any more terms for the field you care about (hence Ard's terms.term().field() == field in his loop conditional) Ok, I wasn't aware of that - I thought that Ards while loop would be wrong, I am deeply hurt by your distrust. :-) Shame on me :-$ Ard signature.asc Description: This is a digitally signed message part
Re: How to read values of a field efficiently
On Mon, 2007-07-30 at 00:30 -0700, Chris Hostetter wrote: : Is it possible to get the values from the ValueSource (or from : getFieldCacheCounts) sorted by its natural order (from lowest to : highest values)? well, an inverted term index is already a data structure listing terms from lowest to highest and the associated documents -- so if you want to iterate from low to high between a range and find matching docs you should just use hte TermEnum -- the whole point of the FieldCache (and FieldCacheSource) is to have a reverse inverted index so you can quickly fetch the indexed value if you know the docId. Ok, I will have a look at the TermEnum and try this. perhaps you should elaborate a little more on what it is you are trying to do so we can help you figure out how to do it more efficinelty ... I want to read all values of the price field of the found docs, and calculate the mean value and the standard deviation. Based on the min value (mean - deviation, the max value (mean + deviation) and the number of prices I calculate price ranges. Then I iterate over the sorted array of prices and count how many prices go into the current range. This sorting (Arrays.sort) takes much time, that's why I asked if it's possible to read values in sorted order. But reading this, I think it would also be possible to skip sorting and check for each price into which bucket it would go and increment the counter for this bucket - this should also be a possibility for optimization. ... perhaps you shouldn't be iterating over every doc to figure out your ranges .. perhaps you can iterate over the terms themselves? Are you referring to TermEnum with this? Thanx cheers, Martin hang on ... rereading your first message i just noticed something i definitely didn't spot before... Fairly long: getFieldCacheCounts for the cat field takes ~70 ms for the second request, while reading prices takes ~600 ms. ...i clearly missed this, and fixated on your assertion that your reading of field values took longer then the stock methods -- but you're not just comparing the time needed byu different methods, you're also timing different fields. this actually makes a lot of sense since there are probably a lot fewer unique values for the cat field, so there are a lot fewer discrete values to deal with when computing counts. -Hoss -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
How to read values of a field efficiently
Hi, I have a custom Facet implementation that extends SimpleFacets and overrides getTermCounts( String field ). For the price field I calculate available ranges, for this I have to read the values for this field. Right this looks like this: public NamedList getTermCounts( final String field ) throws IOException { SchemaField sf = searcher.getSchema().getField( field ); FieldType ft = sf.getType(); final DocValues docValues = ft.getValueSource( sf ).getValues( searcher.getReader() ); final DocIterator iter = docs.iterator(); final TIntArrayList prices = new TIntArrayList( docs.size() ); while (iter.hasNext()) { float value = docValues.floatVal(iter.next()); prices.add( (int)value ); } // calculate ranges and return the result } This part (reading field values) takes fairly long compared to the other fields (that use getFacetTermEnumCounts or getFieldCacheCounts as implemented in SimpleFacets), so that I asume that there is potential for optimization. Fairly long: getFieldCacheCounts for the cat field takes ~70 ms for the second request, while reading prices takes ~600 ms. Is there a better way (in terms of performance) to determine the values for the found docs? Thanx in advance, cheers, Martin signature.asc Description: This is a digitally signed message part
Indexing question - split word and comma
Hi all, I have a document with a name field like this: field name='name'MP3-Player, Apple, #xBB;iPod nano#xAB;, silber, 4GB/field and want to find apple. Unfortunately, I only find apple,... Can anybody help me with this? The schema.xml containts the following field definition field name=name type=text indexed=true stored=true/ and this fieldType definition for type text: fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype The default search field is text: defaultSearchFieldtext/defaultSearchField with the following definition: field name=text type=text indexed=true stored=false multiValued=true/ and the copy from name to text... copyField source=name dest=text/ Thanx in advance, cheers, Martin signature.asc Description: This is a digitally signed message part
Re: Indexing question - split word and comma
On Thu, 2007-07-05 at 11:56 -0700, Mike Klaas wrote: On 5-Jul-07, at 11:43 AM, Martin Grotzke wrote: Hi all, I have a document with a name field like this: field name='name'MP3-Player, Apple, #xBB;iPod nano#xAB;, silber, 4GB/field and want to find apple. Unfortunately, I only find apple,... Can anybody help me with this? Sure: you're using WhitespaceAnalyzer, which only splits on whitespace. If you want to split words from punctuation, you should use something like StandardAnalyzer or WordDelimiterFilter. I replaced tokenizer class=solr.WhitespaceTokenizerFactory/ by tokenizer class=solr.StandardTokenizerFactory/ in the indexer part of the fieldtype definition, and now I find apple and ipod, really great! It is also extremely helpful to look at the analysis page on the solr admin (verbose=true) and see exactly what tokens your analyzer produces. This is such a cool tool, I didn't know it! It's really great that you see each step of the filters so that it's possible to understand better what's going on during indexing, really, really cool!! Thanx a lot, cheers, Martin -Mike signature.asc Description: This is a digitally signed message part
Re: Same record belonging to multiple facets
On Thu, 2007-07-05 at 12:39 -0700, Thiago Jackiw wrote: Is there a way for a record to belong to multiple facets? If so, how would one go about implementing it? What I'd like to accomplish would be something like: record A: name=John Doe category_facet=Cars category_facet=Electronics Isn't this the multiValued=true property in your field definition for category_facet? Cheers, Martin And when searching for John Doe his record would appear under both Cars and Electronics facet categories. Thanks. -- Thiago Jackiw signature.asc Description: This is a digitally signed message part
Re: Dynamically calculated range facet
Chris, thanx for all this info! I'll think about these things again and then come back to you... Cheers, Martin On Tue, 2007-06-26 at 23:22 -0700, Chris Hostetter wrote: : my documents (products) have a price field, and I want to have : a dynamically calculated range facet for that in the response. FYI: there have been some previous discussions on this topic... http://www.nabble.com/blahblah-t2387813.html#a6799060 http://www.nabble.com/faceted-browsing-t1363854.html#a3753053 : AFAICS I do not have the possibility to specify range queries in my : application, as I do not have a clue what's the lowest and highest : price in the search result and what are good ranges according : to the (statistical) distribution of prices in the search result. as mentioned in one of those threads, it's *really* hard to get the statistical sampling to the point where it's both balanced, but also user freindly. writing code specificly for price ranges in dollars lets you make some assumptions about things that give you nice ranges (rounding to one significant digit less then the max, doing log based ranges, etc..) that wouldn't really apply if you were trying to implement a truely generic dynamic range generator. one thing to keep in mind: it's typically not a good idea to have the constraint set of a facet change just because some other constraint was added to the query -- individual constraints might disappear because they no longer apply, but it can be very disconcerting to a user to when options hcange on them if i search on ipod a statistical analysis of prices might yeild facet ranges of $1-20, $20-60, $60-120, $120-$200 ... if i then click on accessories the statistics might skew cheaper, so hte new ranges are $1-20, $20-30, $30-40, $40-70 ... and now i'm a frustrated user, because i relaly wanted ot use the range $20-60 (that just happens to be my budget) and you offered it to me and then you took it away ... i have to undo my selection or accessories then click $20-60, and then click accessories to get what i wnat ... not very nice. : So if it would be possible to go over each item in the search result : I could check the price field and define my ranges for the specific : query on solr side and return the price ranges as a facet. : Otherwise, what would be a good starting point to plug in such : functionality into solr? if you relaly want to do statistical distributions, one way to avoid doing all of this work on the client side (and needing to pull back all of hte prices from all of hte matches) would be to write a custom request handler that subclasses whichever on you currently use and does this computation on the server side -- where it has lower level access to the data and doesn't need to stream it over the wire. FieldCache in particular would come in handy. it occurs to me that even though there may not be a way to dynamicly create facet ranges that can apply usefully on any numeric field, we could add generic support to the request handlers for optionally fetching some basic statistics about a DocSet for clients that want them (either for building ranges, or for any other purpose) min, max, mean, median, mode, midrange ... those should all be easy to compute using the ValueSource from the field type (it would be nice if FieldType's had some way of indicating which DocValues function can best manage the field type, but we can always assume float or have an option for dictating it ... people might want a float mean for an int field anyway) i suppose even stddev could be computed fairly easily ... there's a formula for that that works well in a single pass over a bunch of values right? -Hoss -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
Re: Dynamically calculated range facet
On Tue, 2007-06-26 at 16:48 -0700, Mike Klaas wrote: On 26-Jun-07, at 3:01 PM, Martin Grotzke wrote: AFAICS I do not have the possibility to specify range queries in my application, as I do not have a clue what's the lowest and highest price in the search result and what are good ranges according to the (statistical) distribution of prices in the search result. So if it would be possible to go over each item in the search result I could check the price field and define my ranges for the specific query on solr side and return the price ranges as a facet. Has anybody done s.th. like this before, or is there s.th. that I'm missing and why this approach does not make sense at all? Otherwise, what would be a good starting point to plug in such functionality into solr? Easy: facet based on fixed ranges (say, every 10 dollars for x 100, 100 dollars for x 1000, etc)., and combine them sensically on the client-side. Requires no solr-side modification. But then I have to find x (the highest value of the price field?) on solr side and also I have to build the fixed ranges on solr side, right? Cheers, Martin A bit harder: define your own request handler that loops over the documents after a search and samples the values of (say) the first 20 docs (or more, but be sure to use the FieldCache if so). Calculate your range queries, facets (code will be almost identical to the code in the builtin request handlers), and return the results. cheers, -Mike -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
Re: Dynamically calculated range facet
On Tue, 2007-06-26 at 19:53 -0700, John Wang wrote: www.browseengine.com has facet search that handles this. You are calculating range facets dynamically? Do you have any code I can have a look at? I had a look at c.b.solr. BoboRequestHandler, but this does not seem to calculate ranges. Cheers, Martin We are working on a solr plugin. -John On 6/26/07, Mike Klaas [EMAIL PROTECTED] wrote: On 26-Jun-07, at 3:01 PM, Martin Grotzke wrote: AFAICS I do not have the possibility to specify range queries in my application, as I do not have a clue what's the lowest and highest price in the search result and what are good ranges according to the (statistical) distribution of prices in the search result. So if it would be possible to go over each item in the search result I could check the price field and define my ranges for the specific query on solr side and return the price ranges as a facet. Has anybody done s.th. like this before, or is there s.th. that I'm missing and why this approach does not make sense at all? Otherwise, what would be a good starting point to plug in such functionality into solr? Easy: facet based on fixed ranges (say, every 10 dollars for x 100, 100 dollars for x 1000, etc)., and combine them sensically on the client-side. Requires no solr-side modification. A bit harder: define your own request handler that loops over the documents after a search and samples the values of (say) the first 20 docs (or more, but be sure to use the FieldCache if so). Calculate your range queries, facets (code will be almost identical to the code in the builtin request handlers), and return the results. cheers, -Mike -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
RE: Dynamically calculated range facet
On Wed, 2007-06-27 at 09:06 -0400, Will Johnson wrote: one thing to keep in mind: it's typically not a good idea to have the constraint set of a facet change just because some other constraint was added to the query -- individual constraints might disappear because they no longer apply, but it can be very disconcerting to a user to when options hcange on them if i search on ipod a statistical analysis of prices might yeild facet ranges of $1-20, $20-60, $60-120, $120-$200 ... if i then click on accessories the statistics might skew cheaper, so hte new ranges are $1-20, $20-30, $30-40, $40-70 ... and now i'm a frustrated user, because i relaly wanted ot use the range $20-60 (that just happens to be my budget) and you offered it to me and then you took it away ... i have to undo my selection or accessories then click $20-60, and then click accessories to get what i wnat ... not very nice. Many of the other engines I've work with in the past did this and it was one of the most requested/implemented features we had with regard to facets. That doesn't make it 'right' but it did tend to make product managers and test users happy. The use case that often came up was the ability to dynamically drill inside ranges. For instance my first search for 'computer on a large ecommerce site might yield ranges of 0-500, 500-1000, 1000-2000, 2000+, selecting 500-1000 might then yield ranges of 500-600, 600-700 and so on. There are also many different algorithms that can be employed: equal frequency per facet count, equal sized ranges, rounded ranges, etc. I just had a conversation with our customer and they also want to have it like this - adjusting with a new facet constraint... Cheers, Martin - will -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
Dynamically calculated range facet
Hi, my documents (products) have a price field, and I want to have a dynamically calculated range facet for that in the response. E.g. I want to have this in the response price:[* TO 20] - 23 price:[20 TO 40] - 42 price:[40 TO *] - 33 if prices are between 0 and 60 but price:[* TO 100] - 23 price:[100 TO 200] - 42 price:[200 TO *] - 33 if prices are between 0 and 300 AFAICS I do not have the possibility to specify range queries in my application, as I do not have a clue what's the lowest and highest price in the search result and what are good ranges according to the (statistical) distribution of prices in the search result. So if it would be possible to go over each item in the search result I could check the price field and define my ranges for the specific query on solr side and return the price ranges as a facet. Has anybody done s.th. like this before, or is there s.th. that I'm missing and why this approach does not make sense at all? Otherwise, what would be a good starting point to plug in such functionality into solr? Thanx a lot in advance, cheers, Martin signature.asc Description: This is a digitally signed message part
Re: All facet.fields for a given facet.query?
On Tue, 2007-06-19 at 11:09 -0700, Chris Hostetter wrote: I solve this problem by having metadata stored in my index which tells my custom request handler what fields to facet on for each category ... How do you define this metadata? Cheers, Martin but i've also got several thousand categories. If you've got less then 100 categories, you could easily enumerate them all with default facet.field params in your solrconfig using seperate requesthandler instances. : What do the experts think about this? you may want to read up on the past discussion of this in SOLR-247 ... in particular note the link to the mail archive where there was assitional discussion about it as well. Where we left things is that it might make sense to support true globging in both fl and facet.field, so you can use naming conventions and say things like facet.field=facet_* but that in general trying to do something like facet.field=* would be a very bad idea even if it was supported. http://issues.apache.org/jira/browse/SOLR-247 -Hoss -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
Re: All facet.fields for a given facet.query?
On Tue, 2007-06-19 at 19:16 +0200, Thomas Traeger wrote: Hi, I'm also just at that point where I think I need a wildcard facet.field parameter (or someone points out another solution for my problem...). Here is my situation: I have many products of different types with totally different attributes. There are currently more than 300 attributes I use dynamic fields to import the attributes into solr without having to define a specific field for each attribute. Now when I make a query I would like to get back all facet.fields that are relevant for that query. I think it would be really nice, if I don't have to know which facets fields are there at query time, instead just import attributes into dynamic fields, get the relevant facets back and decide in the frontend which to display and how... Do you really need all facets in the frontend? Would it be a solution to have a facet ranking in the field definitions, and then decide at query time, on which fields to facet on? This would need an additional query parameter like facet.query.count. E.g. if you have a query with q=foo+AND+prop1:bar+AND+prop2:baz and you have fields prop1 with facet-ranking 100 prop2 with facet-ranking 90 prop3 with facet-ranking 80 prop4 with facet-ranking 70 prop5 with facet-ranking 60 then you might decide not to facet on prop1 and prop2 as you have already a constraint on it, but to facet on prop3 and prop4 if facet.query.count is 2. Just thinking about that... :) Cheers, Martin What do the experts think about this? Tom -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
Re: All facet.fields for a given facet.query?
On Wed, 2007-06-20 at 12:59 +0200, Thomas Traeger wrote: Martin Grotzke schrieb: On Tue, 2007-06-19 at 19:16 +0200, Thomas Traeger wrote: [...] I think it would be really nice, if I don't have to know which facets fields are there at query time, instead just import attributes into dynamic fields, get the relevant facets back and decide in the frontend which to display and how... Do you really need all facets in the frontend? no, only the subset with matches for the current query. ok, that's somehow similar to our requirement, but we want to get only e.g. the first 5 relevant facets back from solr and not handle this in the frontend. Would it be a solution to have a facet ranking in the field definitions, and then decide at query time, on which fields to facet on? This would need an additional query parameter like facet.query.count. [...] One step after the other ;o), the ranking of the facets will be another problem I have to solve, counts of facets and matching documents will be a starting point. Another idea is to use the score of the documents returned by the query to compute a score for the facet.field... Yep, this is also different for different applications. I'm also interested in this problem and would like to help solving this problem (though I'm really new to lucene and solr)... Cheers, Martin Tom -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
Re: All facet.fields for a given facet.query?
On Wed, 2007-06-20 at 12:49 -0700, Chris Hostetter wrote: : I solve this problem by having metadata stored in my index which tells : my custom request handler what fields to facet on for each category ... : How do you define this metadata? this might be a good place to start, note that this message is almost two years old, and predates the opensourcing of Solr ... the Servlet refered to in this thread is Solr. http://www.nabble.com/Announcement%3A-Lucene-powering-CNET.com-Product-Category-Listings-p748420.html ...i think i also talked a bit about the metadata documents in my apachecon slides from last yera ... but i don't really remember, and i haven't look at them in a while... http://people.apache.org/~hossman/apachecon2006us/ thx, I'll have a look at these resources. cheers, martin -Hoss signature.asc Description: This is a digitally signed message part
Re: Solr 1.2 HTTP Client for Java
On Thu, 2007-06-14 at 11:32 +0100, Daniel Alheiros wrote: Hi I've been using one Java client I got from a colleague but I don't know exactly its version or where to get any update for it. Base package is org.apache.solr.client (where there are some common packages) and the client main package is org.apache.solr.client.solrj. Is it available via Maven2 central repository? Have a look at the issue tracker, there's one with solr clients: http://issues.apache.org/jira/browse/SOLR-20 I've also used one of them, but to be honest, do not remember which one ;) Cheers, Martin Regards, Daniel http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this. signature.asc Description: This is a digitally signed message part
Re: Interesting Practical Solr Question
On Tue, 2007-05-22 at 13:06 -0400, Erik Hatcher wrote: On May 22, 2007, at 11:31 AM, Martin Grotzke wrote: You need to specify the constrants (facet.query or facet.field params) Too bad, so we would have either to know the schema in the application or provide queries for index metadata / the schema / faceting info. However, the LukeRequestHandler (currently a work in progress) provides the fields and their types. You certainly would want to specify which fields you want returned as facets rather than it just assuming you want all fields (consider a full-text field!). For sure, perhaps the schema field element could be extended by an attribute isfacet. But then we reach the point where we want to have facet categories, and depending on the context (query) different facets (categories) are returned. But really cool what information the LukeRequestHandler provides!! Cheers, Martin http://wiki.apache.org/solr/LukeRequestHandler Erik -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part
Re: Interesting Practical Solr Question
On Tue, 2007-05-22 at 15:10 -0400, Erik Hatcher wrote: On May 22, 2007, at 1:36 PM, Martin Grotzke wrote: For sure, perhaps the schema field element could be extended by an attribute isfacet There is no effective difference between a facet field and any other indexed field. What fields are facets is application specific and not really something Solr should be responsible for. In Solr Flare (an evolving Ruby on Rails plugin), we made the decision to use naming conventions to determine what fields are facets. *_facet named fields are facets. Maybe that convention would work in your scenario also? Yes, this is an option, good idea. Thanx cheers, Martin Erik signature.asc Description: This is a digitally signed message part
Re: PriceJunkie.com using solr!
Very nice and really fast, congrats! Are you willing to provide the mentioned features to solr users? I think espacially the category to facet management (facet groups) is really useful... It would be very nice to have this problem solved once... :) Cheers, Martin On Wed, 2007-05-16 at 16:28 -0500, Mike Austin wrote: I just wanted to say thanks to everyone for the creation of solr. I've been using it for a while now and I have recently brought one of my side projects online. I have several other projects that will be using solr for it's search and facets. Please check out www.pricejunkie.com and let us know what you think.. You can give feedback and/or sign up on the mailing list for future updates. The site is very basic right now and many new and useful features plus merchants and product categories will be coming soon! I thought it would be a good idea to at least have a few people use it to get some feedback early and often. Some of the nice things behind the scenes that we did with solr: - created custom request handlers that have category to facet to attribute caching built in - category to facet management - ability to manage facet groups (attributes within a set facet) and assign them to categories - ability to create any category structure and share facet groups - facet inheritance for any category (a facet group can be defined on a parent category and pushed down to all children) - ability to create sub-categories as facets instead of normal sub categories - simple xml configuration for the final outputted category configuration file I'm sure there are more cool things but that is all for now. Join the mailing list to see more improvements in the future. Also.. how do I get added to the Using Solr wiki page? Thanks, Mike Austin -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part