Re: Time-based frequency table at scale
Hi, did you try exploding the arrays, then doing the aggregation/count and at the end applying a udf to add the 0 values ? my experience is working on arrays is usually a bad idea. sakag writes: > Hi all, > > We have a rather interesting use case, and are struggling to come up with an > approach that scales. Reaching out to seek your expert opinion/feedback and > tips. > > What we are trying to do is to find the count of numerical ids over a > sliding time window where each of our data records has a timestamp and a set > of numerical ids in the below format. > > timestamp | ids > 1 [1,2,3,8] > 1 [1,2] > 2 [1,2,3,4] > 2 [1, 10] > > What we are looking to get as output is: > > timestamp | id_count_map > 1 | {1: 2, 2: 2, 3: 1, 4:0, 5:0, 6:0, 8:1} > 2 | {1: 2, 2:1, 3: 1, 4: 1, 5:0, 6:0, 7:0, 8:0, 9:0, 10:1} > > This gives us the frequency of occurrence of these ids over time periods. > Please note that the output expected is in a dense format. > > However, we are running into scale issues with the data that has these > characteristics. > > - 500 million records - Total ~100 GB > - Each record can have 500 elements in the ids column > - Max id value (length of id_count_map) is 750K > > We have tried the below approaches to achieve this > 1) Expanding ids to a dense, frequency-based vector and then doing a > row-wise sum over a Window partitioned by timestamp > 2) Converting ids into a SparseVector and computing the L1 norm (using > Summarizer) over a Window partitioned by timestamp > 3) GroupBy/aggregating ids by timestamp, converting to a sparse, > frequency-based vector using collections.Counter, and expanding to a dense > format > 4) GroupBy/aggregating ids by timestamp, converting to a sparse, > frequency-based vector using CountVectorizer, and then expanding to a dense > format > > Any other approaches we could try? > > Thanks! > Sakshi -- nicolas paris - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Time-based frequency table at scale
An interesting puzzle indeed. What is your measure of "that scales"? Does not fail, does not spill, does not need a huge amount of memory / disk, is O(N), processes X records per second and core? Enrico Am 11.03.20 um 16:59 schrieb sakag: Hi all, We have a rather interesting use case, and are struggling to come up with an approach that scales. Reaching out to seek your expert opinion/feedback and tips. What we are trying to do is to find the count of numerical ids over a sliding time window where each of our data records has a timestamp and a set of numerical ids in the below format. timestamp | ids 1 [1,2,3,8] 1 [1,2] 2 [1,2,3,4] 2 [1, 10] What we are looking to get as output is: timestamp | id_count_map 1 | {1: 2, 2: 2, 3: 1, 4:0, 5:0, 6:0, 8:1} 2 | {1: 2, 2:1, 3: 1, 4: 1, 5:0, 6:0, 7:0, 8:0, 9:0, 10:1} This gives us the frequency of occurrence of these ids over time periods. Please note that the output expected is in a dense format. However, we are running into scale issues with the data that has these characteristics. - 500 million records - Total ~100 GB - Each record can have 500 elements in the ids column - Max id value (length of id_count_map) is 750K We have tried the below approaches to achieve this 1) Expanding ids to a dense, frequency-based vector and then doing a row-wise sum over a Window partitioned by timestamp 2) Converting ids into a SparseVector and computing the L1 norm (using Summarizer) over a Window partitioned by timestamp 3) GroupBy/aggregating ids by timestamp, converting to a sparse, frequency-based vector using collections.Counter, and expanding to a dense format 4) GroupBy/aggregating ids by timestamp, converting to a sparse, frequency-based vector using CountVectorizer, and then expanding to a dense format Any other approaches we could try? Thanks! Sakshi -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org