Re: field collapsing performance in sharded environment

2013-11-15 Thread Paul Masurel
That's not the way grouping is done.
On a first round all shards return their 10 best group (represented as
their 10 best grouping values).

As a result it's a three round thing instead of the two round for regular
search, so observing an increasing in latency is normal but not in the
realm of what you are seeing here.

Most probably it is due to the performance issue of TermAllGroupsCollector
which you can patch very easily.


On Thu, Nov 14, 2013 at 3:56 PM, Erick Erickson erickerick...@gmail.comwrote:

 bq:   Of the 10k docs,
 most have a unique near duplicate hash value, so there are about 10k unique
 values for the field that I'm grouping on.

 I suspect (but don't know the grouping code well) that this is the issue.
 You're
 getting the top N groups, right? But in the general case, you can't insure
 that the
 topN from shard1 has any relation to the topN from shard2. So I _suspect_
 that
 the code returns all of the groups. Say that shard1 for group 5 has 3 docs,
 but
 for shard2 has 3,000 docs. Do get the true top N, you need to collate all
 the values
 from all the groups; you can't just return the top 10 groups from each
 shard and
 get correct counts.

 Since your group cardinality is about 10K/shard, you're pushing 10 packets
 each
 containing 10K entries back to the originating shard, which has to
 combine/sort
 them all to get the true top N. At least that's my theory.

 Your situation is special in that you say that your groups don't appear on
 more than
 one shard, so you'd probably have to write something that aborted this
 behavior and
 returned only the top N, if I'm right.

 But that begs the question of why you're doing this. What purpose is served
 by
 grouping on documents that probably only have 1 member?

 Best,
 Erick


 On Wed, Nov 13, 2013 at 2:46 PM, David Anthony Troiano 
 dtroi...@basistech.com wrote:

  Hello,
 
  I'm hitting a performance issue when using field collapsing in a
  distributed Solr setup and I'm wondering if others have seen it and if
  anyone has an idea to work around. it.
 
  I'm using field collapsing to deduplicate documents that have the same
 near
  duplicate hash value, and deduplicating at query time (as opposed to
  filtering at index time) is a requirement.  I have a sharded setup with
 10
  cores (not SolrCloud), each having ~1000 documents each.  Of the 10k
 docs,
  most have a unique near duplicate hash value, so there are about 10k
 unique
  values for the field that I'm grouping on.  The grouping parameters that
  I'm using are:
 
  group=true
  group.field=near dupe hash field
  group.main=true
 
  I'm attempting distributed queries (shards=s1,s2,...,s10) where the only
  difference is the absence or presence of these three grouping parameters
  and I'm consistently seeing a marked difference in performance (as a
  representative data point, 200ms latency without grouping and 1600ms with
  grouping).  Interestingly, if I put all 10k docs on the same core and
 query
  that core independently with and without grouping, I don't see much of a
  latency difference, so the performance degradation seems to exist only in
  the sharded setup.
 
  Is there a known performance issue when field collapsing in a sharded
 setup
  (perhaps only manifests when the grouping field has many unique values),
 or
  have other people observed this?  Any ideas for a workaround?  Note that
  docs in my sharded setup can only have the same signature if they're in
 the
  same shard, so perhaps that can be used to boost perf, though I don't see
  an exposed way to do so.
 
  A follow-on question is whether we're likely to see the same issue if /
  when we move to SolrCloud.
 
  Thanks,
  Dave
 




-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Does MMap works on the Virtual Box?

2013-08-16 Thread Paul Masurel
Hi,

You can MMAP a size bigger than your memory without having any problem.
Part of your file will just not be loaded into RAM, because you don't
access it too often.

If you are short in memory, consider deactivating page Host IO Caching, as
it will be only redundant with your guest
OS page cache.

Regards,

Paul



On Fri, Aug 16, 2013 at 10:26 PM, Shawn Heisey s...@elyograg.org wrote:

 On 8/16/2013 1:02 PM, vibhoreng04 wrote:

 I have a big index of 256 GB .Right now it is on one physical box of 256
 GB
 RAM . I am planning to virtualize it to the size of 32 GB Ram*8
 boxes.Whether the MMap will work regardless in this condition ?


 As far as MMap goes, if the operating system you are running is 64-bit,
 your Java is 64-bit, and the OS supports MMap (which almost every operating
 system does, including Linux and Windows), then you'd be fine.

 If you have the option of running Solr on bare metal vs. running on the
 same hardware in a virtualized environment, you should always choose the
 bare metal.

 I had a Solr installation with a sharded index.  When I first set it up, I
 used virtual machines, one Solr instance and shard per VM.  Half the VMs
 were running on one physical box, half on another.  For redundancy, I had a
 second pair of physical servers doing the same thing, each with VMs
 representing half the index.

 That same setup now runs on bare metal -- the exact same physical
 machines, in fact.  The index arrangement is nearly the same as before,
 except it uses multicore Solr, one instance per machine.

 Removing the virtualization layer helped performance quite a bit. Average
 QTimes went way down and it took less time to do a full index rebuild.

 Thanks,
 Shawn




-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Unexpected behavior when sorting groups

2013-08-06 Thread Paul Masurel
On Mon, Aug 5, 2013 at 2:42 AM, Tony Paloma to...@valvesoftware.com wrote:

 Thanks Paul. That's helpful. I'm not familiar with the concept of custom
 caches. Would this be custom Java code or something defined in the
 config/schema? Can you point me to some documentation?


My solution requires both writing custom java code and define stuff in your
solr.config.
I'm waiting for approval to release my plugin, but I'm afraid I don't have
any
visibility on the length of the process.

There is only the bare minimum in the documentation.
http://wiki.apache.org/solr/SolrCaching

Write a class extending

*public class YourCache extends SolrCacheBase implements
SolrCacheBytesRef,Double*

You just add some XML in your solr config to instantiate your custom cache.
At each commit, Solr will call warm... You can inline the code to recompute
all your min price here or delegate it to a CacheRegenerator.

You then need to declare ValueSource hitting on this cache.
You can access your cache in its parse function via the functionqparser :*


SolrIndexSearcher searcher = fp.getReq().getSearcher();
YourCache cache = (YourCache)searcher.getCache(cacheName);*





Another workaround I was thinking of using was making two Solr queries when
 wanting to sort groups by price desc. One to get the number of total groups
 and then another that gets groups sorted by price asc starting from ngroups
 - (start+rows) and then just flip the ordering to fake sorting by
 min(price) desc, but I was worried about the performance implications of
 that.


That should work indeed... But keep in mind it will be extremely expensive
if you start distributing your queries :
if you want to get hits from 100 to 110, shards will be asked to send hits
from 0 to 110.



 SOLR-2072 has a similar request.
 https://issues.apache.org/jira/browse/SOLR-2072

 Bryan's comment is exactly what I'm looking for:
  I would like to able to use sort and group.sort together such that the
 group.sort is applied with in the group first and the first document of
 each group is then used as the representative document to perform the
 overall sorting of groups.

 The latest comment there suggests that it's a bug in distributed mode, but
 I don't think that's the case since I'm only using one instance of Solr
 with no sharding or anything.


This is not a bug. If I get some time, I'll try to write a post about how
collapsing is working in Solr.
Even though it is counterintuitive, what you are asking for is actually a
difficult problem.

Regards,

Paul



 -Original Message-
 From: Paul Masurel [mailto:paul.masu...@gmail.com]
 Sent: Sunday, August 04, 2013 2:54 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Unexpected behavior when sorting groups

 Dear Tony,

 The behavior you described is correct, and what you are requiring is
 impossible with Solr as it is.

 I wouldn't however say it is a limitation of Solr : your problem is
 actually difficult and require some preprocessing.

 One solution if it is feasible for you is to precompute the lowest price
 of your groups beforehands and add a field to all of the document of your
 group.

 The other way to address your problem is to do that within Solr.
 This can be done by adding a custom cache holding these values.
 You can implement the computation of the min price in the warm method.

 You can then add a custom function to return the result stored in this
 cache. Function values can be used for sorting.

 If if does not exist yet, you may open a ticket. I will try and get
 authorization to opensource a solution for this.

 Regards,

 Paul




 On Sat, Aug 3, 2013 at 12:00 AM, Tony Paloma to...@valvesoftware.com
 wrote:

  I'm using field collapsing to group documents by a single field and
  have run into something unexpected with how sorting of the groups
  works. Right now I have each group return one document. The documents
  within each group are sorted by a field (price) in ascending order
  using group.sort so that the document returned for each group in the
  search results is the cheapest document of the group. If I sort the
  groups amongst themselves using sort=price asc, I get what I expect
  with groups having documents whose lowest price value is low show
  first and groups having documents whose lowest price value is high show
 last.
 
  If I change this to sort on price desc, what happens is not what I
  would expect. I would like the groups to be returned in reverse order
  from what happened when sorting by price asc. Instead, what happens is
  the groups are sorted in descending order according to the highest
  priced document in each group. I want groups to be sorted in
  descending order according to the lowest priced document in each group,
 but it appears this is not possible.
  In other words, it appears sorting when groups are involved is limited
  to either MAX([field]) DESC or MIN([field]) ASC. The other two
  combinations are not possible. Does anyone know whether

Re: Unexpected behavior when sorting groups

2013-08-06 Thread Paul Masurel
Here is some detail about how grouping is implemented in Solr.
http://fulmicoton.com/posts/grouping-in-solr/



On Mon, Aug 5, 2013 at 2:42 AM, Tony Paloma to...@valvesoftware.com wrote:

 Thanks Paul. That's helpful. I'm not familiar with the concept of custom
 caches. Would this be custom Java code or something defined in the
 config/schema? Can you point me to some documentation?

 Another workaround I was thinking of using was making two Solr queries
 when wanting to sort groups by price desc. One to get the number of total
 groups and then another that gets groups sorted by price asc starting from
 ngroups - (start+rows) and then just flip the ordering to fake sorting by
 min(price) desc, but I was worried about the performance implications of
 that.

 SOLR-2072 has a similar request.
 https://issues.apache.org/jira/browse/SOLR-2072

 Bryan's comment is exactly what I'm looking for:
  I would like to able to use sort and group.sort together such that the
 group.sort is applied with in the group first and the first document of
 each group is then used as the representative document to perform the
 overall sorting of groups.

 The latest comment there suggests that it's a bug in distributed mode, but
 I don't think that's the case since I'm only using one instance of Solr
 with no sharding or anything.

 -Original Message-
 From: Paul Masurel [mailto:paul.masu...@gmail.com]
 Sent: Sunday, August 04, 2013 2:54 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Unexpected behavior when sorting groups

 Dear Tony,

 The behavior you described is correct, and what you are requiring is
 impossible with Solr as it is.

 I wouldn't however say it is a limitation of Solr : your problem is
 actually difficult and require some preprocessing.

 One solution if it is feasible for you is to precompute the lowest price
 of your groups beforehands and add a field to all of the document of your
 group.

 The other way to address your problem is to do that within Solr.
 This can be done by adding a custom cache holding these values.
 You can implement the computation of the min price in the warm method.

 You can then add a custom function to return the result stored in this
 cache. Function values can be used for sorting.

 If if does not exist yet, you may open a ticket. I will try and get
 authorization to opensource a solution for this.

 Regards,

 Paul




 On Sat, Aug 3, 2013 at 12:00 AM, Tony Paloma to...@valvesoftware.com
 wrote:

  I'm using field collapsing to group documents by a single field and
  have run into something unexpected with how sorting of the groups
  works. Right now I have each group return one document. The documents
  within each group are sorted by a field (price) in ascending order
  using group.sort so that the document returned for each group in the
  search results is the cheapest document of the group. If I sort the
  groups amongst themselves using sort=price asc, I get what I expect
  with groups having documents whose lowest price value is low show
  first and groups having documents whose lowest price value is high show
 last.
 
  If I change this to sort on price desc, what happens is not what I
  would expect. I would like the groups to be returned in reverse order
  from what happened when sorting by price asc. Instead, what happens is
  the groups are sorted in descending order according to the highest
  priced document in each group. I want groups to be sorted in
  descending order according to the lowest priced document in each group,
 but it appears this is not possible.
  In other words, it appears sorting when groups are involved is limited
  to either MAX([field]) DESC or MIN([field]) ASC. The other two
  combinations are not possible. Does anyone know whether or not this is
  in fact impossible, and if not, how I might put in a feature request?
 



 --
 __

  Masurel Paul
  e-mail: paul.masu...@gmail.com




-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Solr grouping performace

2013-08-05 Thread Paul Masurel
Collapsing is not that slow actually. With a high number of groups,
you may just have to let group.ngroups set to false.

If you need to get the overall number of groups, you may have
to patch lucene.


https://issues.apache.org/jira/browse/LUCENE-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709974#comment-13709974
Martijn patch for instance may work ok for your range of values.

On Mon, Aug 5, 2013 at 9:11 AM, Alok Bhandari 
alokomprakashbhand...@gmail.com wrote:

 Hello ,
 I need some functionality for which I found that grouping is the most
 suited
 feature. I want to know about performance issue associated with it. On some
 posts I found that performance is  an bottleneck but want to know that if I
 am having 3  million records with 0.5 million distinct values for
 group.value then can I expect results to return in 2-3 seconds? the
 grouping
 field is an int , also I want only one filed for a document. I can afford
 t use upto 4GB RAM.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-grouping-performace-tp4082480.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Unexpected behavior when sorting groups

2013-08-04 Thread Paul Masurel
Dear Tony,

The behavior you described is correct, and what you are requiring
is impossible with Solr as it is.

I wouldn't however say it is a limitation of Solr : your problem is actually
difficult and require some preprocessing.

One solution if it is feasible for you is to precompute the lowest price
of your groups beforehands and add a field to all of the document of your
group.

The other way to address your problem is to do that within Solr.
This can be done by adding a custom cache holding these values.
You can implement the computation of the min price in the warm method.

You can then add a custom function to return the result stored in this
cache. Function values can be used for sorting.

If if does not exist yet, you may open a ticket. I will try and get
authorization
to opensource a solution for this.

Regards,

Paul




On Sat, Aug 3, 2013 at 12:00 AM, Tony Paloma to...@valvesoftware.comwrote:

 I'm using field collapsing to group documents by a single field and have
 run into something unexpected with how sorting of the groups works. Right
 now I have each group return one document. The documents within each group
 are sorted by a field (price) in ascending order using group.sort so that
 the document returned for each group in the search results is the cheapest
 document of the group. If I sort the groups amongst themselves using
 sort=price asc, I get what I expect with groups having documents whose
 lowest price value is low show first and groups having documents whose
 lowest price value is high show last.

 If I change this to sort on price desc, what happens is not what I would
 expect. I would like the groups to be returned in reverse order from what
 happened when sorting by price asc. Instead, what happens is the groups are
 sorted in descending order according to the highest priced document in each
 group. I want groups to be sorted in descending order according to the
 lowest priced document in each group, but it appears this is not possible.
 In other words, it appears sorting when groups are involved is limited to
 either MAX([field]) DESC or MIN([field]) ASC. The other two combinations
 are not possible. Does anyone know whether or not this is in fact
 impossible, and if not, how I might put in a feature request?




-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: TrieField and FieldCache confusion

2013-08-01 Thread Paul Masurel
Thank you very much for your very fast answer and
all the pointers.

That's what I thought, but then I got confused by the last note
http://wiki.apache.org/solr/StatsComponent

TrieFields http://wiki.apache.org/solr/TrieFields has to use a
precisionStep of -1 to avoid using
UnInvertedFieldhttp://wiki.apache.org/solr/UnInvertedField.java.
Consider using one field for doing stats, and one for doing range facetting
on. 

I assume it referred to former version of Solr.




On Wed, Jul 31, 2013 at 7:43 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Can I expect the FieldCache of Lucene to return the correct values when
 : working
 : with TrieField with the precisionStep higher than 0. If not, what did I
 get
 : wrong?

 Yes -- the code for building FieldCaches from Trie fields is smart enough
 to ensure that only the real original values are used to populate the
 Cache

 (See for example: FieldCache.NUMERIC_UTILS_INT_PARSER and the classes
 linked to from it's javadocs...


 https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/FieldCache.html#NUMERIC_UTILS_INT_PARSER

 https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/util/NumericUtils.html

 https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/document/IntField.html

 (Solr's Trie fields are backed by the various numeric fields in lucene --
 ie: solr:TrieIntField - lucene:IntField.  the Trie* prefix is used in
 solr because there already had classes named IntField, DoubleField, etc...
 when the Trie based impls where added to lucene)


 -Hoss




-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Unexpected character '' (code 60) expected '='

2013-08-01 Thread Paul Masurel
You can check for your xml validity with xmllint very simply.

xmllint file

Does this return an error?



On Thu, Aug 1, 2013 at 9:59 AM, deniz denizdurmu...@gmail.com wrote:

 Vineet Mishra wrote
  I am using Solr 3.5 with the posting XML file size of just 1Mb.
 
 
  On Wed, Jul 31, 2013 at 8:19 PM, Shawn Heisey lt;

  solr@

  gt; wrote:
 
  On 7/31/2013 7:16 AM, Vineet Mishra wrote:
   I checked the File. . .nothing is there. I mean the formatting is
  correct,
   its a valid XML file.
 
  What version of Solr, and how large is your XML file?
 
  If Solr is older than version 4.1, then the POST buffer limit is decided
  by your container config, which based on your stacktrace, is tomcat.  If
  you have 4.1 or later, then the POST buffer limit is decided by Solr,
  and defaults to 2048KiB.
 
  Could that be the problem?
 
  Thanks,
  Shawn
 
 


 you might need to escape some chars like  to lt; and so on



 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Unexpected-character-code-60-expected-tp4081603p4081854.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


Re: Group and performing statistics on groups

2013-08-01 Thread Paul Masurel
https://issues.apache.org/jira/browse/SOLR-2931

Please add a word on the JIRA describing your mean and
keep an eye on the ticket. I might release such a plugin
any time soon.

Regards,

Paul



On Fri, Jul 26, 2013 at 4:16 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 I think no, and I think there is a JIRA issue open for that.

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



 On Fri, Jul 26, 2013 at 2:32 PM, Vineet Mishra clearmido...@gmail.com
 wrote:
  Hi
 
  This is a urgent call, I am grouping the solr documents by a field name
 and
  want to get the Range(Min and Max) value for another field in that group.
 
  StatsComponent works fine on all the document as whole rendering the max
  and min of a field, is it possible to get the StatsComponent per group of
  the solr.
 
 
  Thanks and Regards
  Vineet




-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com


TrieField and FieldCache confusion

2013-07-31 Thread Paul Masurel
Hello everyone,

I have a question about Solr TrieField and Lucene FieldCache.

From my understanding, Solr added the implementation of TrieField to
perform faster range queries.
For each value it will index multiple terms. The n-th term being a masked
version of our value,
showing only it first (precisionStep * n) bits.

When uninverting the field to populate a FieldCache, the last value with
regard
to the lexicographical order will be retained ; which from my understanding
should
be the term with the highest precision.

Can I expect the FieldCache of Lucene to return the correct values when
working
with TrieField with the precisionStep higher than 0. If not, what did I get
wrong?

Regards,

Paul Masurel
e-mail: paul.masu...@gmail.com


Re: FieldCollapsing issues in SolrCloud 4.4

2013-07-31 Thread Paul Masurel
If your issue is that you want to retrieve the number of groups,
group.ngroups will return the sum of the number of groups per shard.

This is not the number of groups overall as there if some groups are present
on more than one shard.

To make sure that this does not happen, one can choose to distribute
documents
so that all the documents with the same group key goes to the same shard.

(Disclaimer : Before doing so, you need to make sure that your documents
will still be spread
about equally.)

You can check out how to do that here
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud





On Wed, Jul 31, 2013 at 8:02 PM, Ali, Saqib docbook@gmail.com wrote:

 Hello Paul,

 Can you please explain what you mean by:
 To get the exact number of groups, you need to shard along your grouping
 field

 Thanks! :)


 On Wed, Jul 31, 2013 at 3:08 AM, Paul Masurel paul.masu...@gmail.com
 wrote:

  Do you mean you get different results with group=true?
  numFound is supposed returns the number of ungrouped hits.
 
  To get the number of groups, you are expected to set
  set group.ngroups=true.
  Even then, the result will only give you an upperbound
  in a distributed environment.
  To get the exact number of groups, you need to shard along
  your grouping field.
 
  If you have many groups, you may also experience a huge performance
  hit, as the current implementation has been heaviy optimized for low
  number of groups (e.g. e-commerce categories).
 
  Paul
 
 
 
  On Wed, Jul 31, 2013 at 1:59 AM, Ali, Saqib docbook@gmail.com
 wrote:
 
   Hello all,
  
   Is anyone experiencing issues with the numFound when using group=true
 in
   SolrCloud 4.4?
  
   Sometimes the results are off for us.
  
   I will post more details shortly.
  
   Thanks.
  
 
 
 
  --
  __
 
   Masurel Paul
   e-mail: paul.masu...@gmail.com
 




-- 
__

 Masurel Paul
 e-mail: paul.masu...@gmail.com