Hi Sumit, No, it doesn't matter whether the values in an IN clause are sorted or not - we'll sort them internally to ensure they are. We've run perf tests in the past where we had 250K values in an IN clause over a 1B row data set, and performance was still good. Take a look at this blog for more detail: http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html
Thanks, James On Mon, Oct 5, 2015 at 2:08 AM, James Heather <[email protected]> wrote: > I'll leave someone else to comment on the Phoenix specifics. > > I recall from some experiments on MySQL that if you have a massive load of > IDs to pass, it's quicker if you split them into batches of some reasonable > (but still large) size, and that for this, you would want to sort them > first. I don't think it made any difference whether the IDs were sorted > within an individual SQL statement, but you want to split into batches that > cover disjoint ranges, so the easiest is to sort the whole lot first and > then split. > > This might be MySQL-specific, though. I think each query was being turned > into a range scan, from the lowest ID in the IN clause to the highest, > which was why it was useful to get the ranges disjoint and not too huge. > > James > > > On 05/10/15 09:49, Sumit Nigam wrote: > > Hi, > > Would it make any difference if I were to pass non-sorted IDs (secondary > indexed) to a huge IN clause? I assume that skip scan optimization would > work in either case. > > Also, can any one let me know if there is some limit to beyond how many > such IDs in a large IN clause do I get into diminishing returns? Or is it > plainly dependent on specific workloads and memory of region servers? > > Thanks, > Sumit > > >
