Re: field collapsing performance in sharded environment
That's not the way grouping is done. On a first round all shards return their 10 best group (represented as their 10 best grouping values). As a result it's a three round thing instead of the two round for regular search, so observing an increasing in latency is normal but not in the realm of what you are seeing here. Most probably it is due to the performance issue of TermAllGroupsCollector which you can patch very easily. On Thu, Nov 14, 2013 at 3:56 PM, Erick Erickson erickerick...@gmail.comwrote: bq: Of the 10k docs, most have a unique near duplicate hash value, so there are about 10k unique values for the field that I'm grouping on. I suspect (but don't know the grouping code well) that this is the issue. You're getting the top N groups, right? But in the general case, you can't insure that the topN from shard1 has any relation to the topN from shard2. So I _suspect_ that the code returns all of the groups. Say that shard1 for group 5 has 3 docs, but for shard2 has 3,000 docs. Do get the true top N, you need to collate all the values from all the groups; you can't just return the top 10 groups from each shard and get correct counts. Since your group cardinality is about 10K/shard, you're pushing 10 packets each containing 10K entries back to the originating shard, which has to combine/sort them all to get the true top N. At least that's my theory. Your situation is special in that you say that your groups don't appear on more than one shard, so you'd probably have to write something that aborted this behavior and returned only the top N, if I'm right. But that begs the question of why you're doing this. What purpose is served by grouping on documents that probably only have 1 member? Best, Erick On Wed, Nov 13, 2013 at 2:46 PM, David Anthony Troiano dtroi...@basistech.com wrote: Hello, I'm hitting a performance issue when using field collapsing in a distributed Solr setup and I'm wondering if others have seen it and if anyone has an idea to work around. it. I'm using field collapsing to deduplicate documents that have the same near duplicate hash value, and deduplicating at query time (as opposed to filtering at index time) is a requirement. I have a sharded setup with 10 cores (not SolrCloud), each having ~1000 documents each. Of the 10k docs, most have a unique near duplicate hash value, so there are about 10k unique values for the field that I'm grouping on. The grouping parameters that I'm using are: group=true group.field=near dupe hash field group.main=true I'm attempting distributed queries (shards=s1,s2,...,s10) where the only difference is the absence or presence of these three grouping parameters and I'm consistently seeing a marked difference in performance (as a representative data point, 200ms latency without grouping and 1600ms with grouping). Interestingly, if I put all 10k docs on the same core and query that core independently with and without grouping, I don't see much of a latency difference, so the performance degradation seems to exist only in the sharded setup. Is there a known performance issue when field collapsing in a sharded setup (perhaps only manifests when the grouping field has many unique values), or have other people observed this? Any ideas for a workaround? Note that docs in my sharded setup can only have the same signature if they're in the same shard, so perhaps that can be used to boost perf, though I don't see an exposed way to do so. A follow-on question is whether we're likely to see the same issue if / when we move to SolrCloud. Thanks, Dave -- __ Masurel Paul e-mail: paul.masu...@gmail.com
Re: Does MMap works on the Virtual Box?
Hi, You can MMAP a size bigger than your memory without having any problem. Part of your file will just not be loaded into RAM, because you don't access it too often. If you are short in memory, consider deactivating page Host IO Caching, as it will be only redundant with your guest OS page cache. Regards, Paul On Fri, Aug 16, 2013 at 10:26 PM, Shawn Heisey s...@elyograg.org wrote: On 8/16/2013 1:02 PM, vibhoreng04 wrote: I have a big index of 256 GB .Right now it is on one physical box of 256 GB RAM . I am planning to virtualize it to the size of 32 GB Ram*8 boxes.Whether the MMap will work regardless in this condition ? As far as MMap goes, if the operating system you are running is 64-bit, your Java is 64-bit, and the OS supports MMap (which almost every operating system does, including Linux and Windows), then you'd be fine. If you have the option of running Solr on bare metal vs. running on the same hardware in a virtualized environment, you should always choose the bare metal. I had a Solr installation with a sharded index. When I first set it up, I used virtual machines, one Solr instance and shard per VM. Half the VMs were running on one physical box, half on another. For redundancy, I had a second pair of physical servers doing the same thing, each with VMs representing half the index. That same setup now runs on bare metal -- the exact same physical machines, in fact. The index arrangement is nearly the same as before, except it uses multicore Solr, one instance per machine. Removing the virtualization layer helped performance quite a bit. Average QTimes went way down and it took less time to do a full index rebuild. Thanks, Shawn -- __ Masurel Paul e-mail: paul.masu...@gmail.com
Re: Unexpected behavior when sorting groups
On Mon, Aug 5, 2013 at 2:42 AM, Tony Paloma to...@valvesoftware.com wrote: Thanks Paul. That's helpful. I'm not familiar with the concept of custom caches. Would this be custom Java code or something defined in the config/schema? Can you point me to some documentation? My solution requires both writing custom java code and define stuff in your solr.config. I'm waiting for approval to release my plugin, but I'm afraid I don't have any visibility on the length of the process. There is only the bare minimum in the documentation. http://wiki.apache.org/solr/SolrCaching Write a class extending *public class YourCache extends SolrCacheBase implements SolrCacheBytesRef,Double* You just add some XML in your solr config to instantiate your custom cache. At each commit, Solr will call warm... You can inline the code to recompute all your min price here or delegate it to a CacheRegenerator. You then need to declare ValueSource hitting on this cache. You can access your cache in its parse function via the functionqparser :* SolrIndexSearcher searcher = fp.getReq().getSearcher(); YourCache cache = (YourCache)searcher.getCache(cacheName);* Another workaround I was thinking of using was making two Solr queries when wanting to sort groups by price desc. One to get the number of total groups and then another that gets groups sorted by price asc starting from ngroups - (start+rows) and then just flip the ordering to fake sorting by min(price) desc, but I was worried about the performance implications of that. That should work indeed... But keep in mind it will be extremely expensive if you start distributing your queries : if you want to get hits from 100 to 110, shards will be asked to send hits from 0 to 110. SOLR-2072 has a similar request. https://issues.apache.org/jira/browse/SOLR-2072 Bryan's comment is exactly what I'm looking for: I would like to able to use sort and group.sort together such that the group.sort is applied with in the group first and the first document of each group is then used as the representative document to perform the overall sorting of groups. The latest comment there suggests that it's a bug in distributed mode, but I don't think that's the case since I'm only using one instance of Solr with no sharding or anything. This is not a bug. If I get some time, I'll try to write a post about how collapsing is working in Solr. Even though it is counterintuitive, what you are asking for is actually a difficult problem. Regards, Paul -Original Message- From: Paul Masurel [mailto:paul.masu...@gmail.com] Sent: Sunday, August 04, 2013 2:54 PM To: solr-user@lucene.apache.org Subject: Re: Unexpected behavior when sorting groups Dear Tony, The behavior you described is correct, and what you are requiring is impossible with Solr as it is. I wouldn't however say it is a limitation of Solr : your problem is actually difficult and require some preprocessing. One solution if it is feasible for you is to precompute the lowest price of your groups beforehands and add a field to all of the document of your group. The other way to address your problem is to do that within Solr. This can be done by adding a custom cache holding these values. You can implement the computation of the min price in the warm method. You can then add a custom function to return the result stored in this cache. Function values can be used for sorting. If if does not exist yet, you may open a ticket. I will try and get authorization to opensource a solution for this. Regards, Paul On Sat, Aug 3, 2013 at 12:00 AM, Tony Paloma to...@valvesoftware.com wrote: I'm using field collapsing to group documents by a single field and have run into something unexpected with how sorting of the groups works. Right now I have each group return one document. The documents within each group are sorted by a field (price) in ascending order using group.sort so that the document returned for each group in the search results is the cheapest document of the group. If I sort the groups amongst themselves using sort=price asc, I get what I expect with groups having documents whose lowest price value is low show first and groups having documents whose lowest price value is high show last. If I change this to sort on price desc, what happens is not what I would expect. I would like the groups to be returned in reverse order from what happened when sorting by price asc. Instead, what happens is the groups are sorted in descending order according to the highest priced document in each group. I want groups to be sorted in descending order according to the lowest priced document in each group, but it appears this is not possible. In other words, it appears sorting when groups are involved is limited to either MAX([field]) DESC or MIN([field]) ASC. The other two combinations are not possible. Does anyone know whether
Re: Unexpected behavior when sorting groups
Here is some detail about how grouping is implemented in Solr. http://fulmicoton.com/posts/grouping-in-solr/ On Mon, Aug 5, 2013 at 2:42 AM, Tony Paloma to...@valvesoftware.com wrote: Thanks Paul. That's helpful. I'm not familiar with the concept of custom caches. Would this be custom Java code or something defined in the config/schema? Can you point me to some documentation? Another workaround I was thinking of using was making two Solr queries when wanting to sort groups by price desc. One to get the number of total groups and then another that gets groups sorted by price asc starting from ngroups - (start+rows) and then just flip the ordering to fake sorting by min(price) desc, but I was worried about the performance implications of that. SOLR-2072 has a similar request. https://issues.apache.org/jira/browse/SOLR-2072 Bryan's comment is exactly what I'm looking for: I would like to able to use sort and group.sort together such that the group.sort is applied with in the group first and the first document of each group is then used as the representative document to perform the overall sorting of groups. The latest comment there suggests that it's a bug in distributed mode, but I don't think that's the case since I'm only using one instance of Solr with no sharding or anything. -Original Message- From: Paul Masurel [mailto:paul.masu...@gmail.com] Sent: Sunday, August 04, 2013 2:54 PM To: solr-user@lucene.apache.org Subject: Re: Unexpected behavior when sorting groups Dear Tony, The behavior you described is correct, and what you are requiring is impossible with Solr as it is. I wouldn't however say it is a limitation of Solr : your problem is actually difficult and require some preprocessing. One solution if it is feasible for you is to precompute the lowest price of your groups beforehands and add a field to all of the document of your group. The other way to address your problem is to do that within Solr. This can be done by adding a custom cache holding these values. You can implement the computation of the min price in the warm method. You can then add a custom function to return the result stored in this cache. Function values can be used for sorting. If if does not exist yet, you may open a ticket. I will try and get authorization to opensource a solution for this. Regards, Paul On Sat, Aug 3, 2013 at 12:00 AM, Tony Paloma to...@valvesoftware.com wrote: I'm using field collapsing to group documents by a single field and have run into something unexpected with how sorting of the groups works. Right now I have each group return one document. The documents within each group are sorted by a field (price) in ascending order using group.sort so that the document returned for each group in the search results is the cheapest document of the group. If I sort the groups amongst themselves using sort=price asc, I get what I expect with groups having documents whose lowest price value is low show first and groups having documents whose lowest price value is high show last. If I change this to sort on price desc, what happens is not what I would expect. I would like the groups to be returned in reverse order from what happened when sorting by price asc. Instead, what happens is the groups are sorted in descending order according to the highest priced document in each group. I want groups to be sorted in descending order according to the lowest priced document in each group, but it appears this is not possible. In other words, it appears sorting when groups are involved is limited to either MAX([field]) DESC or MIN([field]) ASC. The other two combinations are not possible. Does anyone know whether or not this is in fact impossible, and if not, how I might put in a feature request? -- __ Masurel Paul e-mail: paul.masu...@gmail.com -- __ Masurel Paul e-mail: paul.masu...@gmail.com
Re: Solr grouping performace
Collapsing is not that slow actually. With a high number of groups, you may just have to let group.ngroups set to false. If you need to get the overall number of groups, you may have to patch lucene. https://issues.apache.org/jira/browse/LUCENE-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709974#comment-13709974 Martijn patch for instance may work ok for your range of values. On Mon, Aug 5, 2013 at 9:11 AM, Alok Bhandari alokomprakashbhand...@gmail.com wrote: Hello , I need some functionality for which I found that grouping is the most suited feature. I want to know about performance issue associated with it. On some posts I found that performance is an bottleneck but want to know that if I am having 3 million records with 0.5 million distinct values for group.value then can I expect results to return in 2-3 seconds? the grouping field is an int , also I want only one filed for a document. I can afford t use upto 4GB RAM. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-grouping-performace-tp4082480.html Sent from the Solr - User mailing list archive at Nabble.com. -- __ Masurel Paul e-mail: paul.masu...@gmail.com
Re: Unexpected behavior when sorting groups
Dear Tony, The behavior you described is correct, and what you are requiring is impossible with Solr as it is. I wouldn't however say it is a limitation of Solr : your problem is actually difficult and require some preprocessing. One solution if it is feasible for you is to precompute the lowest price of your groups beforehands and add a field to all of the document of your group. The other way to address your problem is to do that within Solr. This can be done by adding a custom cache holding these values. You can implement the computation of the min price in the warm method. You can then add a custom function to return the result stored in this cache. Function values can be used for sorting. If if does not exist yet, you may open a ticket. I will try and get authorization to opensource a solution for this. Regards, Paul On Sat, Aug 3, 2013 at 12:00 AM, Tony Paloma to...@valvesoftware.comwrote: I'm using field collapsing to group documents by a single field and have run into something unexpected with how sorting of the groups works. Right now I have each group return one document. The documents within each group are sorted by a field (price) in ascending order using group.sort so that the document returned for each group in the search results is the cheapest document of the group. If I sort the groups amongst themselves using sort=price asc, I get what I expect with groups having documents whose lowest price value is low show first and groups having documents whose lowest price value is high show last. If I change this to sort on price desc, what happens is not what I would expect. I would like the groups to be returned in reverse order from what happened when sorting by price asc. Instead, what happens is the groups are sorted in descending order according to the highest priced document in each group. I want groups to be sorted in descending order according to the lowest priced document in each group, but it appears this is not possible. In other words, it appears sorting when groups are involved is limited to either MAX([field]) DESC or MIN([field]) ASC. The other two combinations are not possible. Does anyone know whether or not this is in fact impossible, and if not, how I might put in a feature request? -- __ Masurel Paul e-mail: paul.masu...@gmail.com
Re: TrieField and FieldCache confusion
Thank you very much for your very fast answer and all the pointers. That's what I thought, but then I got confused by the last note http://wiki.apache.org/solr/StatsComponent TrieFields http://wiki.apache.org/solr/TrieFields has to use a precisionStep of -1 to avoid using UnInvertedFieldhttp://wiki.apache.org/solr/UnInvertedField.java. Consider using one field for doing stats, and one for doing range facetting on. I assume it referred to former version of Solr. On Wed, Jul 31, 2013 at 7:43 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Can I expect the FieldCache of Lucene to return the correct values when : working : with TrieField with the precisionStep higher than 0. If not, what did I get : wrong? Yes -- the code for building FieldCaches from Trie fields is smart enough to ensure that only the real original values are used to populate the Cache (See for example: FieldCache.NUMERIC_UTILS_INT_PARSER and the classes linked to from it's javadocs... https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/FieldCache.html#NUMERIC_UTILS_INT_PARSER https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/util/NumericUtils.html https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/document/IntField.html (Solr's Trie fields are backed by the various numeric fields in lucene -- ie: solr:TrieIntField - lucene:IntField. the Trie* prefix is used in solr because there already had classes named IntField, DoubleField, etc... when the Trie based impls where added to lucene) -Hoss -- __ Masurel Paul e-mail: paul.masu...@gmail.com
Re: Unexpected character '' (code 60) expected '='
You can check for your xml validity with xmllint very simply. xmllint file Does this return an error? On Thu, Aug 1, 2013 at 9:59 AM, deniz denizdurmu...@gmail.com wrote: Vineet Mishra wrote I am using Solr 3.5 with the posting XML file size of just 1Mb. On Wed, Jul 31, 2013 at 8:19 PM, Shawn Heisey lt; solr@ gt; wrote: On 7/31/2013 7:16 AM, Vineet Mishra wrote: I checked the File. . .nothing is there. I mean the formatting is correct, its a valid XML file. What version of Solr, and how large is your XML file? If Solr is older than version 4.1, then the POST buffer limit is decided by your container config, which based on your stacktrace, is tomcat. If you have 4.1 or later, then the POST buffer limit is decided by Solr, and defaults to 2048KiB. Could that be the problem? Thanks, Shawn you might need to escape some chars like to lt; and so on - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/Unexpected-character-code-60-expected-tp4081603p4081854.html Sent from the Solr - User mailing list archive at Nabble.com. -- __ Masurel Paul e-mail: paul.masu...@gmail.com
Re: Group and performing statistics on groups
https://issues.apache.org/jira/browse/SOLR-2931 Please add a word on the JIRA describing your mean and keep an eye on the ticket. I might release such a plugin any time soon. Regards, Paul On Fri, Jul 26, 2013 at 4:16 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I think no, and I think there is a JIRA issue open for that. Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Jul 26, 2013 at 2:32 PM, Vineet Mishra clearmido...@gmail.com wrote: Hi This is a urgent call, I am grouping the solr documents by a field name and want to get the Range(Min and Max) value for another field in that group. StatsComponent works fine on all the document as whole rendering the max and min of a field, is it possible to get the StatsComponent per group of the solr. Thanks and Regards Vineet -- __ Masurel Paul e-mail: paul.masu...@gmail.com
TrieField and FieldCache confusion
Hello everyone, I have a question about Solr TrieField and Lucene FieldCache. From my understanding, Solr added the implementation of TrieField to perform faster range queries. For each value it will index multiple terms. The n-th term being a masked version of our value, showing only it first (precisionStep * n) bits. When uninverting the field to populate a FieldCache, the last value with regard to the lexicographical order will be retained ; which from my understanding should be the term with the highest precision. Can I expect the FieldCache of Lucene to return the correct values when working with TrieField with the precisionStep higher than 0. If not, what did I get wrong? Regards, Paul Masurel e-mail: paul.masu...@gmail.com
Re: FieldCollapsing issues in SolrCloud 4.4
If your issue is that you want to retrieve the number of groups, group.ngroups will return the sum of the number of groups per shard. This is not the number of groups overall as there if some groups are present on more than one shard. To make sure that this does not happen, one can choose to distribute documents so that all the documents with the same group key goes to the same shard. (Disclaimer : Before doing so, you need to make sure that your documents will still be spread about equally.) You can check out how to do that here https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud On Wed, Jul 31, 2013 at 8:02 PM, Ali, Saqib docbook@gmail.com wrote: Hello Paul, Can you please explain what you mean by: To get the exact number of groups, you need to shard along your grouping field Thanks! :) On Wed, Jul 31, 2013 at 3:08 AM, Paul Masurel paul.masu...@gmail.com wrote: Do you mean you get different results with group=true? numFound is supposed returns the number of ungrouped hits. To get the number of groups, you are expected to set set group.ngroups=true. Even then, the result will only give you an upperbound in a distributed environment. To get the exact number of groups, you need to shard along your grouping field. If you have many groups, you may also experience a huge performance hit, as the current implementation has been heaviy optimized for low number of groups (e.g. e-commerce categories). Paul On Wed, Jul 31, 2013 at 1:59 AM, Ali, Saqib docbook@gmail.com wrote: Hello all, Is anyone experiencing issues with the numFound when using group=true in SolrCloud 4.4? Sometimes the results are off for us. I will post more details shortly. Thanks. -- __ Masurel Paul e-mail: paul.masu...@gmail.com -- __ Masurel Paul e-mail: paul.masu...@gmail.com