Data modeling for segmenting a huge data set: precomputing vs real time computations

eugene miretsky Tue, 21 Aug 2018 07:37:26 -0700

Hello,

We have a very big data set (200M+ users), for each user we store their
activity (views, sales, etc.) and we need to be able to segment the users
based on that data.


One (very simplified) use case looks something like:
- Data: customer_id, date, category, sub_category, action
- Query: All customers that had X views in last Y days in sub_category Z.

The problem becomes huge very fast since we have hundereds of categories
and actions and several aggreagation functions (total views, min purchase
amount, etc.)

There are 2 ways to go about it
1) Pre-compute everything (memory intensive):
-  Very wide rows for each user: sub_categories x features x aggregation
functions  (we don't need every single permutation, but we still end up
with 1000+ columns).
- The tables are very sparse since most of the columns are null for most
users
2) Store the row data and perform the grouping in real time (compute
intensive)

Currently we are trying a middle version
-  Precompute all the features every day: (customer_id, day,
man_fashion_views, electronics_views, electronics_purchase_amount)
-  Filter by date and perform the aggregation in real time:  (SELECT
customer_id FROM testTable where date > '2018-08-02' GROUP BY customer_id
HAVING SUM(man_fashion_views) > 2 AND MAX(electronics_purchase_amount) <
100) )

Currently we are trying to understand the internals of Ignite a bit better
in order to understand how to model and optimize the above use case
1) How are sparse rows stored? Do nulls take the same amount of space as
actual values?
2) What is the cost of bringing data from off-heap memory to the heap?
3) How are filtering, group_by and aggregates performed?
 -- Does all the data get moved from off-heap to heap on every query?
 -- Is any filtering performed off-heap?
 -- If we need to calculate  aggregate A over a 3 day period and aggregate
B over 2 weeks period. Should we do it in 2 queries and then join the
results or in 1 query?
 -- Are indexes used during group_by? (say we are grouping by category,
would an index on category column help significantly?)
 -- Are there any optimization tricks to speed group_by?
4) One big table vs many smaller tables (all with customer_id as
affinityKey)
-- What is the cost of joining across colocated tables?

We are going to benchmark most of these, but we just started testing Ignite
and don't have a good intuition about what is likely to work.

Would apprecaite any tips you can provide.

Cheers,
Eugene



Cheers,
Eugene

Data modeling for segmenting a huge data set: precomputing vs real time computations

Reply via email to