potentially using SplittableDoFn to zipWithIndex for a large file

Chak-Pong Chung Wed, 12 Dec 2018 18:59:34 -0800

Hello everyone!

I asked the following question and think I might get some suggestions
whether what I want is doable or not.


https://stackoverflow.com/questions/53746046/how-can-i-implement-zipwithindex-like-spark-in-apache-beam/53747612#53747612

If I can get `PCollection` id and the number of (contiguous)lines in each
`PCollection`, then I can calculate the row order within each
partition/`PCollection`  first and then do prefix-sum to compute the offset
for each partition. This is doable in MPI or openMP since I can get the
id/rank of each processor/thread.

Anton pointed out the current design wants to allow dynamic
scheduling/allocation at run-time. My approach works for static allocation
at compile-time with fixed number of hardware resources.

There could be another way to look at this problem. The file can also sit
in hdfs or google cloud storage before processing in Beam. So we might also
reduce the problem to uploading and splitting such a big file into chunks
and at the same time preserving the row order within the file. In this
case, by the time Beam processing chunks of this file there is no need to
preserve row order work.

Best,
Chak-Pong

potentially using SplittableDoFn to zipWithIndex for a large file

Reply via email to