For those interested, after digging further, I was able to consistently
reproduce the issue with a synthetic dataset. My findings are documented
here:

https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f


--
Regards,


Ivan Gozali
Lecida
Email: i...@lecida.com

On Wed, Jan 18, 2017 at 12:14 PM, Ivan Gozali <i...@lecida.com> wrote:

> Hello,
>
> I have a use case that seems relatively simple to solve using Spark, but
> can't seem to figure out a sure way to do this.
>
> I have a dataset which contains time series data for various users. All
> I'm looking to do is:
>
>    - partition this dataset by user ID
>    - sort the time series data for each user which by then should
>    supposedly be contained within individual partitions,
>    - write each partition to a single CSV file. In the end I'd like to
>    end up with 1 CSV file per user ID.
>
> I tried using the following code snippet, but ended up getting surprising
> results. I do end up with 1 csv file user ID and most of the users' time
> series data were indeed sorted, but those of a small fraction of users were
> unsorted.
>
> # repr(ds) = DataFrame[userId: string, timestamp: string, c1: float, c2:
> float, c3: float, ...]
> ds = load_dataset(user_dataset_path)
> ds.repartition("userId").sortWithinPartitions("
> timestamp").write.partitionBy("userId").option("header",
> "true").csv(output_path)
>
> I'm unclear as to why this could happen, and I'm not entirely sure how to
> do this. I'm also not sure if this is potentially a bug in Spark.
>
> I'm using Spark 2.0.2 with Python 2.7.12. Any advice would be very much
> appreciated!
>
> --
> Regards,
>
>
> Ivan Gozali
>

Reply via email to