GitHub user zheh12 opened a pull request:

    https://github.com/apache/spark/pull/21286

    [SPARK-24194] HadoopFsRelation cannot overwrite a path that is also b…

    ## What changes were proposed in this pull request?
    
    When there are multiple tasks at the same time append a `HadoopFsRelation`, 
there will be an error, there are the following two errors: 
    
    1. A task will succeed, but the data will be wrong and more data than 
excepted will appear
    2. Other tasks will fail with `java.io.FileNotFoundException: Failed to get 
file status skip_dir/_temporary/0`
    
    The main reason for this problem is because multiple job will use the same 
`_temporary` directory.
    
    So the core idea of this `PR` is to create a different temporary directory 
with jobId for the different Job in the `output` folder , so that conflicts can 
be avoided.
    
    ## How was this patch tested?
    
    I manually tested. 
    But I don't know how to write a unit test for this situation. Please help 
me.
    
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zheh12/spark SPARK-24238

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21286.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21286
    
----
commit b676a36af110b0b7d7dfc47ab292d09c441f6a0f
Author: yangz <zheh12@...>
Date:   2018-05-10T01:46:49Z

    [SPARK-24194] HadoopFsRelation cannot overwrite a path that is also being 
read from

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to