Re: Time-based frequency table at scale

Enrico Minack Wed, 11 Mar 2020 11:52:27 -0700

An interesting puzzle indeed.

What is your measure of "that scales"? Does not fail, does not spill,does not need a huge amount of memory / disk, is O(N), processes Xrecords per second and core?


Enrico

Am 11.03.20 um 16:59 schrieb sakag:

Hi all,

We have a rather interesting use case, and are struggling to come up with an

approach that scales. Reaching out to seek your expert opinion/feedback and
tips.

What we are trying to do is to find the count of numerical ids over a

sliding time window where each of our data records has a timestamp and a set
of numerical ids in the below format.

timestamp | ids

1  [1,2,3,8]
1  [1,2]
2  [1,2,3,4]
2  [1, 10]

What we are looking to get as output is:timestamp | id_count_map

1 | {1: 2, 2: 2, 3: 1, 4:0, 5:0, 6:0, 8:1}
2 | {1: 2, 2:1, 3: 1, 4: 1, 5:0, 6:0, 7:0, 8:0, 9:0, 10:1}

This gives us the frequency of occurrence of these ids over time periods.

Please note that the output expected is in a dense format.

However, we are running into scale issues with the data that has these

characteristics.

- 500 million records - Total ~100 GB

- Each record can have 500 elements in the ids column
- Max id value (length of id_count_map) is 750K

We have tried the below approaches to achieve this

1) Expanding ids to a dense, frequency-based vector and then doing a
row-wise sum over a Window partitioned by timestamp
2) Converting ids into a SparseVector and computing the L1 norm (using
Summarizer) over a Window partitioned by timestamp
3) GroupBy/aggregating ids by timestamp, converting to a sparse,
frequency-based vector using collections.Counter, and expanding to a dense
format
4) GroupBy/aggregating ids by timestamp, converting to a sparse,
frequency-based vector using CountVectorizer, and then expanding to a dense
format

Any other approaches we could try?Thanks!

Sakshi



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Time-based frequency table at scale

Reply via email to