Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file?

2018-04-26 Thread cane
Thanks Steve! I will study about links you mentioned! -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file?

2018-04-26 Thread Steve Loughran
sorry, not noticed this followup. Been busy with other issues On 3 Apr 2018, at 11:19, cane mailto:zhoukang199...@gmail.com>> wrote: Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may cause data loss. I check the comment of thi api: We should make sure our tasks are idemp

Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file?

2018-04-07 Thread 周康
I observe that. If commit Job done on driver and commit task done on executor. With speculation enable,it may cause data loss. Since commit Job will call listStatus and commit Task will delete output file if already exist and rename to final output. When listStatus called after delete and before re

Re: saveAsNewAPIHadoopDataset must not enable speculation for parquet file?

2018-04-03 Thread Steve Loughran
> On 3 Apr 2018, at 11:19, cane wrote: > > Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may cause > data loss. > I check the comment of thi api: > > We should make sure our tasks are idempotent when speculation is enabled, > i.e. do > * not use output committer that w

saveAsNewAPIHadoopDataset must not enable speculation for parquet file?

2018-04-03 Thread cane
Now, if we use saveAsNewAPIHadoopDataset with speculation enable.It may cause data loss. I check the comment of thi api: We should make sure our tasks are idempotent when speculation is enabled, i.e. do * not use output committer that writes data directly. * There is an example in https://