Hello,

I have a use case that seems relatively simple to solve using Spark, but
can't seem to figure out a sure way to do this.

I have a dataset which contains time series data for various users. All I'm
looking to do is:

   - partition this dataset by user ID
   - sort the time series data for each user which by then should
   supposedly be contained within individual partitions,
   - write each partition to a single CSV file. In the end I'd like to end
   up with 1 CSV file per user ID.

I tried using the following code snippet, but ended up getting surprising
results. I do end up with 1 csv file user ID and most of the users' time
series data were indeed sorted, but those of a small fraction of users were
unsorted.

# repr(ds) = DataFrame[userId: string, timestamp: string, c1: float, c2:
float, c3: float, ...]
ds = load_dataset(user_dataset_path)
ds.repartition("userId").sortWithinPartitions("timestamp").write.partitionBy("userId").option("header",
"true").csv(output_path)

I'm unclear as to why this could happen, and I'm not entirely sure how to
do this. I'm also not sure if this is potentially a bug in Spark.

I'm using Spark 2.0.2 with Python 2.7.12. Any advice would be very much
appreciated!

--
Regards,


Ivan Gozali

Reply via email to