[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15422
  
@srowen Since this is happening 'below' the user code (in the hadoop rdd), 
is there a way around how to handle this ?
I agree that for a lot of usecases where it is critical to work off the 
entire data, we should abort rather than process incomplete (and corrupt) data; 
not sure if we want to customize ability to do this though.

EOFException being thrown instead of IOException is an implementation 
detail of the codec if I am not wrong (I dont think the contract specifies 
this) - @zsxwing can confirm though, I am not very familiar with that.

setting finished = true should probably be done not just for EOFException, 
but also for IOException.

I am fine with that as well, though this is a behavior change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15422
  
@zsxwing The map task is run by 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java
 , no ?

You can take a look at skipping bad records here : 
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Skipping+Bad+Records


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15422
  
> For example, in MR you have the ability to even set the percentage of bad 
records you want to tolerate (we dont have that in spark).

I may be wrong. But in MR, I think bad records just means `map` or `reduce` 
throws an exception. It's not related to any IOException (including 
EOFExcpetion). 
(https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L1490)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15422
  
@mridulm for the scenario you're imagining, maybe the data is OK, sure. 
That doesn't mean it's true in all cases. Yeah, this is really to work around 
bad input, which you can to some degree do at the user level. Other parts of 
Spark don't work this way. I'm neutral on whether this is a good idea at all, 
but would prefer consistency more than anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15422
  
@marmbrus +1 on logging, that is definitely something which was probably 
missed here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15422
  
@zsxwing You are right, NewHadoopRDD is not handling this case.
Probably would be good to add exception handling there when nextKeyValue 
throws exception ?

Context is, for large jobs/data, it is not unexpected to see some data 
corruption at times. We dont want to throw out the entire job due to a few bad 
records.
For example, in MR you have the ability to even set the percentage of bad 
records you want to tolerate (we dont have that in spark).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15422
  
@mridulm This fix just makes HadoopRDD consistent with NewHadoopRDD and the 
current behavior of Spark SQL in 2.0.

For 1.6, that's another story since Spark SQL uses HadoopRDD directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread marmbrus
Github user marmbrus commented on the issue:

https://github.com/apache/spark/pull/15422
  
I agree that the data that was already read is probably good.  I also think 
that this is a pretty big behavior change where there are legitimate cases 
(i.e. tons of data and it is fine to miss some) where you'd only want a 
warning.  Can we add a flag for failing on unexpected EOF?  (probably set to 
`true` in `master` and `false` in `branch-1.6`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15422
  
@srowen The tuples already returned would have been valid, it is the 
subsequent block decompression which has failed. For example, in a 1gb file, 
the last few bytes missing (or corrupt) will cause the last block to be 
decompressed incorrectly - but all previous tuples already returned would be 
fine and valid.

The 'current' key/value resulted in exception.
Note that the returned value from the method is not used when finished = 
true (what is returned is ignored).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15422
  
If this happens it isn't clear that anything that was read is valid. It 
doesn't seem like something to ignore. Log, at least. I know people differ on 
this but I think continuing with partial and maybe corrupt read even with a 
warning seems more likely to cause tears. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-11 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/15422
  
Why would corrupt record cause EOFException to be thrown ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15422
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15422
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66698/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15422
  
**[Test build #66698 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66698/consoleFull)**
 for PR 15422 at commit 
[`f810937`](https://github.com/apache/spark/commit/f8109371e6eab2ab4b30268aa7684bb685198297).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15422
  
**[Test build #66698 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66698/consoleFull)**
 for PR 15422 at commit 
[`f810937`](https://github.com/apache/spark/commit/f8109371e6eab2ab4b30268aa7684bb685198297).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15422
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15422
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15422
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66691/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15422
  
**[Test build #66691 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66691/consoleFull)**
 for PR 15422 at commit 
[`f810937`](https://github.com/apache/spark/commit/f8109371e6eab2ab4b30268aa7684bb685198297).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15422
  
**[Test build #66691 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66691/consoleFull)**
 for PR 15422 at commit 
[`f810937`](https://github.com/apache/spark/commit/f8109371e6eab2ab4b30268aa7684bb685198297).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15422: [SPARK-17850][Core]HadoopRDD should not catch EOFExcepti...

2016-10-10 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15422
  
/cc @marmbrus 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org