Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15422
@srowen Since this is happening 'below' the user code (in the hadoop rdd),
is there a way around how to handle this ?
I agree that for a lot of usecases where it is critical to work off the
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15422
@zsxwing The map task is run by
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/15422
> For example, in MR you have the ability to even set the percentage of bad
records you want to tolerate (we dont have that in spark).
I may be wrong. But in MR, I think bad records just
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/15422
@mridulm for the scenario you're imagining, maybe the data is OK, sure.
That doesn't mean it's true in all cases. Yeah, this is really to work around
bad input, which you can to some degree do at
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15422
@marmbrus +1 on logging, that is definitely something which was probably
missed here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15422
@zsxwing You are right, NewHadoopRDD is not handling this case.
Probably would be good to add exception handling there when nextKeyValue
throws exception ?
Context is, for large
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/15422
@mridulm This fix just makes HadoopRDD consistent with NewHadoopRDD and the
current behavior of Spark SQL in 2.0.
For 1.6, that's another story since Spark SQL uses HadoopRDD directly.
Github user marmbrus commented on the issue:
https://github.com/apache/spark/pull/15422
I agree that the data that was already read is probably good. I also think
that this is a pretty big behavior change where there are legitimate cases
(i.e. tons of data and it is fine to miss
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15422
@srowen The tuples already returned would have been valid, it is the
subsequent block decompression which has failed. For example, in a 1gb file,
the last few bytes missing (or corrupt) will cause
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/15422
If this happens it isn't clear that anything that was read is valid. It
doesn't seem like something to ignore. Log, at least. I know people differ on
this but I think continuing with partial and
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15422
Why would corrupt record cause EOFException to be thrown ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15422
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15422
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66698/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15422
**[Test build #66698 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66698/consoleFull)**
for PR 15422 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15422
**[Test build #66698 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66698/consoleFull)**
for PR 15422 at commit
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/15422
retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so,
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15422
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15422
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66691/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15422
**[Test build #66691 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66691/consoleFull)**
for PR 15422 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15422
**[Test build #66691 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66691/consoleFull)**
for PR 15422 at commit
Github user zsxwing commented on the issue:
https://github.com/apache/spark/pull/15422
/cc @marmbrus
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or
21 matches
Mail list logo