gianm commented on issue #6066: Sorting rows when rollup is disabled
URL:
https://github.com/apache/incubator-druid/issues/6066#issuecomment-410052884
> I'm guessing Druid does the rollup by sorting the dimensions in the order
in the ingestion spec, which would indicate that dimension ordering would
matter. Would ordering dims by based on cardinality help further? I'm guessing
low->high card would be better than high->low, but idk enough about LZ(4?)
compression to have a good intuition on this.
Yep it does sort in the order that you provide them in the ingestion spec.
LZ4 works by finding and eliminating redundancies in a sliding window, so it
works best if you have a lot of local redundancy. CONCISE / Roaring also work
better in that case (they can encode runs). And selecting values out of columns
is faster if you are selecting rows that are close together.
If you mostly care about size, I think you should do best starting with a
lower cardinality column, especially one that a lot of other columns will have
some kind of dependency on (like if you have a 'country' column and each
region, city, user, etc will generally just be in one country). That way you're
maximizing compression for as many columns as possible. If you care about
retrieval speed too then it helps to think about locality: if you have a column
you often filter on, you want it first/early in the sort order. Sometimes a
column satisfies both of these, and in that case life is good (e.g. "tenant_id"
for a multi-tenant-one-table design).
In the case I mentioned (where we saw dramatically better compression after
sorting) it was an e-commerce dataset, and we sorted rows first on product
brand, then on product id.
Btw, this line of thinking is most effective if your queryGranularity is
coarse, since Druid always sorts time _first_ and then your other dimensions.
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
With regards,
Apache Git Services
-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org