Hello, We have a very big data set (200M+ users), for each user we store their activity (views, sales, etc.) and we need to be able to segment the users based on that data.
One (very simplified) use case looks something like: - Data: customer_id, date, category, sub_category, action - Query: All customers that had X views in last Y days in sub_category Z. The problem becomes huge very fast since we have hundereds of categories and actions and several aggreagation functions (total views, min purchase amount, etc.) There are 2 ways to go about it 1) Pre-compute everything (memory intensive): - Very wide rows for each user: sub_categories x features x aggregation functions (we don't need every single permutation, but we still end up with 1000+ columns). - The tables are very sparse since most of the columns are null for most users 2) Store the row data and perform the grouping in real time (compute intensive) Currently we are trying a middle version - Precompute all the features every day: (customer_id, day, man_fashion_views, electronics_views, electronics_purchase_amount) - Filter by date and perform the aggregation in real time: (SELECT customer_id FROM testTable where date > '2018-08-02' GROUP BY customer_id HAVING SUM(man_fashion_views) > 2 AND MAX(electronics_purchase_amount) < 100) ) Currently we are trying to understand the internals of Ignite a bit better in order to understand how to model and optimize the above use case 1) How are sparse rows stored? Do nulls take the same amount of space as actual values? 2) What is the cost of bringing data from off-heap memory to the heap? 3) How are filtering, group_by and aggregates performed? -- Does all the data get moved from off-heap to heap on every query? -- Is any filtering performed off-heap? -- If we need to calculate aggregate A over a 3 day period and aggregate B over 2 weeks period. Should we do it in 2 queries and then join the results or in 1 query? -- Are indexes used during group_by? (say we are grouping by category, would an index on category column help significantly?) -- Are there any optimization tricks to speed group_by? 4) One big table vs many smaller tables (all with customer_id as affinityKey) -- What is the cost of joining across colocated tables? We are going to benchmark most of these, but we just started testing Ignite and don't have a good intuition about what is likely to work. Would apprecaite any tips you can provide. Cheers, Eugene Cheers, Eugene
