[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15422 @srowen Since this is happening 'below' the user code (in the hadoop rdd), is there a way around how to handle this ? I agree that for a lot of usecases where it is critical to work off the

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15422 @zsxwing The map task is run by

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread zsxwing
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15422 > For example, in MR you have the ability to even set the percentage of bad records you want to tolerate (we dont have that in spark). I may be wrong. But in MR, I think bad records just

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15422 @mridulm for the scenario you're imagining, maybe the data is OK, sure. That doesn't mean it's true in all cases. Yeah, this is really to work around bad input, which you can to some degree do at

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15422 @marmbrus +1 on logging, that is definitely something which was probably missed here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15422 @zsxwing You are right, NewHadoopRDD is not handling this case. Probably would be good to add exception handling there when nextKeyValue throws exception ? Context is, for large

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread zsxwing
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15422 @mridulm This fix just makes HadoopRDD consistent with NewHadoopRDD and the current behavior of Spark SQL in 2.0. For 1.6, that's another story since Spark SQL uses HadoopRDD directly.

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread marmbrus
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/15422 I agree that the data that was already read is probably good. I also think that this is a pretty big behavior change where there are legitimate cases (i.e. tons of data and it is fine to miss

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15422 @srowen The tuples already returned would have been valid, it is the subsequent block decompression which has failed. For example, in a 1gb file, the last few bytes missing (or corrupt) will cause

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15422 If this happens it isn't clear that anything that was read is valid. It doesn't seem like something to ignore. Log, at least. I know people differ on this but I think continuing with partial and

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15422 Why would corrupt record cause EOFException to be thrown ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15422 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15422 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66698/ Test FAILed. ---

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15422 **[Test build #66698 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66698/consoleFull)** for PR 15422 at commit

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15422 **[Test build #66698 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66698/consoleFull)** for PR 15422 at commit

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread zsxwing
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15422 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so,

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15422 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15422 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66691/ Test FAILed. ---

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15422 **[Test build #66691 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66691/consoleFull)** for PR 15422 at commit

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15422 **[Test build #66691 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66691/consoleFull)** for PR 15422 at commit

[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread zsxwing
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15422 /cc @marmbrus --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or