[GitHub] gianm commented on issue #6066: Sorting rows when rollup is disabled

2018-08-02 Thread GitBox
gianm commented on issue #6066: Sorting rows when rollup is disabled
URL: 
https://github.com/apache/incubator-druid/issues/6066#issuecomment-410052884
 
 
   > I'm guessing Druid does the rollup by sorting the dimensions in the order 
in the ingestion spec, which would indicate that dimension ordering would 
matter. Would ordering dims by based on cardinality help further? I'm guessing 
low->high card would be better than high->low, but idk enough about LZ(4?) 
compression to have a good intuition on this.
   
   Yep it does sort in the order that you provide them in the ingestion spec. 
LZ4 works by finding and eliminating redundancies in a sliding window, so it 
works best if you have a lot of local redundancy. CONCISE / Roaring also work 
better in that case (they can encode runs). And selecting values out of columns 
is faster if you are selecting rows that are close together.
   
   If you mostly care about size, I think you should do best starting with a 
lower cardinality column, especially one that a lot of other columns will have 
some kind of dependency on (like if you have a 'country' column and each 
region, city, user, etc will generally just be in one country). That way you're 
maximizing compression for as many columns as possible. If you care about 
retrieval speed too then it helps to think about locality: if you have a column 
you often filter on, you want it first/early in the sort order. Sometimes a 
column satisfies both of these, and in that case life is good (e.g. "tenant_id" 
for a multi-tenant-one-table design).
   
   In the case I mentioned (where we saw dramatically better compression after 
sorting) it was an e-commerce dataset, and we sorted rows first on product 
brand, then on product id.
   
   Btw, this line of thinking is most effective if your queryGranularity is 
coarse, since Druid always sorts time _first_ and then your other dimensions.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] gianm commented on issue #6066: Sorting rows when rollup is disabled

2018-08-02 Thread GitBox
gianm commented on issue #6066: Sorting rows when rollup is disabled
URL: 
https://github.com/apache/incubator-druid/issues/6066#issuecomment-410039394
 
 
   IMO the speed hit on ingestion is acceptable. It is small, and performance 
is not really worse than the rollup case. Adding sorting will incur a 
performance cost no matter what: either it has to happen continuously or it has 
to happen in one big shot before persisting.
   
   Any idea where the speed hit on groupBys comes from? Seems strange that 
other query types wouldn't care but groupBy does. At any rate, I am not super 
worried about it, since in most clusters QueryableIndex storage/perf is much 
more serious issue than IncrementalIndex query perf. If we end up incurring a 
small speed hit on IncrementalIndex queries in order to get better compression 
on QueryableIndexes, then that is a good tradeoff.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org