[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365495#comment-16365495
 ] 

Steve Loughran commented on SPARK-23308:


I'm going to recommend this is closed as a WONTFIX. Not the place of Spark to 
determine what is recoverable —it'd never get it right

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363217#comment-16363217
 ] 

Márcio Furlani Carmona commented on SPARK-23308:


{quote}if your input stream is doing abort/reopen on seek & positioned read, 
then you get many more S3 requests when reading columnar data
{quote}
Good point! I'll see if I can get the actual S3 TPS we're reaching.

 

Regarding the SSE-KMS, we're not using it for encryption, so that shouldn't be 
a problem.

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360809#comment-16360809
 ] 

Steve Loughran commented on SPARK-23308:


BTW

bq  I should get at least ~82k partitions, thus the same number of S3 requests. 

if your input stream is doing abort/reopen on seek & positioned read, then you 
get many more S3 requests when reading columnar data, which bounces around a 
lot.  See HADOOP-13203 for the work there, & the recent change of HADOOP-14965 
to at least reduce the TCP abort call count, but not doing anything for the GET 
count, just using smaller ranges in the GET calls. Oh, and if you use SSE-KMS, 
separate throttling, but AFAIK it should only surface on the initial GET, when 
the encryption kicks off

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-08 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357387#comment-16357387
 ] 

Steve Loughran commented on SPARK-23308:


HADOOP-15216 covers S3A handling this failure with backoff, also FNFEs on 
delete inconsistency

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-08 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357266#comment-16357266
 ] 

Steve Loughran commented on SPARK-23308:


bq. Other option would be creating a special exception 
(CorruptedFileException?) that could be thrown by FS implementations and let 
them decide what is a corrupted file or just a transient error.

Its pretty hard to get consistent semantics on "working" FS behaviour, let 
alone failure modes; it's why the Hadoop FS specs and compliance tests have the 
notion of strict failure "does what HDFS does" and "lax", "raises an IOE". 
AFAIK HDFS raises {{ChecksumException}} on checksum errors, I don't know what 
it does on. say. decryption failure or erasure coding problems. and don't 
really want to look. You could try to add a parent class here, "Unrecoverable 
IOE" & see about getting it in to everything over time.

Common prefixes and the classic year=2018/month=12 partitioning is pretty 
pathological for S3. But like you say, 503 is the standard response, though it 
may be caught in the AWS SDK. Talk to the AWS people

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356271#comment-16356271
 ] 

Márcio Furlani Carmona commented on SPARK-23308:


That's true Steve. I totally agree that it'd be hard to identify what is 
retryable from Spark's perspective. That being said, I agree that the FS should 
be responsible for that decision.

I believe one option would be leaving the retry responsibility to FS 
implementations (as it already seems to be) and adding the documentation for 
this flag making clear you might experience some data losses. Other option 
would be creating a special exception (CorruptedFileException?) that could be 
thrown by FS implementations and let them decide what is a corrupted file or 
just a transient error.

A more complex approach would be having a file blacklist mechanism rather than 
this flag, similarly to the `spark.blacklist.*` feature, where you can decide 
how many times to retry a file before considering it corrupted, then you won't 
need to decide what is worth retrying or not. The downside is that you'll 
always retry, even when there's no point in retrying.

 

*Regarding the socket timeouts:* I also believe it's some kind of throttling.

I'm reading over 80k files with over 10TB of data. The file sizes are not 
uniform, so some files may be read in a single request, other might get split 
into multiple partitions. Considering I'm not overriding the default 
`spark.files.maxPartitionBytes` value of `128 MB`, I should get at least ~82k 
partitions, thus the same number of S3 requests. Also, the files share some 
common prefixes, which [might be bad for S3 index 
access|https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html].

I have a total of 120 executors, with 9 cores each, across 30 workers. But 
since each task involves some good computational time, the median task 
execution time is 8s, so I don't think we're getting close to the 100 TPS for a 
common prefix mentioned in the S3 documentation above.

So, I'm more inclined to say it might be some EC2 network throttling rather 
than S3 throttling. Another reason to believe on that is because I've seen some 
[503 Slow 
Down|https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html] 
errors from S3 in the past when my requests got throttled, which I'm not seeing 
this time.

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-06 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354784#comment-16354784
 ] 

Steve Loughran commented on SPARK-23308:


bq. I have not heard this come up before as an issue in another implementation.

S3A's input stream handles an IOE other than EOF with a: increment metrics, 
close the stream, retry once; generally that causes the error to be recovered 
from. If not, you are into the unrecoverable-network-problems kind of problem, 
except for the special case of "you are recycling the pool of HTTP connections 
and should abort that TCP connection before trying anything else". I think 
there are opportunities to improve S3A there by aborting the connection before 
retrying.

I don't think Spark is in the position to  be clever about retries, as its too 
low-level as to what is retryable vs not; it would need a policy for all 
possible exceptions from all known FS clients and split them into "we can 
recover" from "no, fail fast"

Trying to come up with a good policy is (a) something the FS clients should be 
doing and (b) really hard to get right in the absence of frequent failures; its 
usually evolution based on bug reports. For example 
[S3ARetryPolicy|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java#L87]
 is very much a WiP (HADOOP-14531).

Marcio: surprised you are getting so many socket timeouts. If this is happening 
in EC2 it's *potentially* throttling related; overloaded connection pools raise 
ConnectionPoolTimeoutException, apparently.

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354250#comment-16354250
 ] 

Márcio Furlani Carmona commented on SPARK-23308:


Yeah, I set it back to `ignoreCorruptFiles=false` to prevent this. But then if 
there's indeed a corrupt file, our job will never succeed until we fix that.

The biggest problem for me was that silent failure you mentioned. I just found 
out there was something wrong after running a job for the same input multiple 
times and noticing some missing data, then I started investigating the reason 
why and figured out it was due to this flag and the SocketTimeoutException I 
mentioned.

I agree the documentation should at least mention the risks of setting this 
flags and for which exceptions it considers the data as corrupt. Right now I 
believe this flag is not even documented officially, is it? 
https://spark.apache.org/docs/latest/configuration.html

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354248#comment-16354248
 ] 

Márcio Furlani Carmona commented on SPARK-23308:


Yeah, I set it back to `ignoreCorruptFiles=false` to prevent this. But then if 
there's indeed a corrupt file, our job will never succeed until we fix that.

The biggest problem for me was that silent failure you mentioned. I just found 
out there was something wrong after running a job for the same input multiple 
times and noticing some missing data, then I started investigating the reason 
why and figured out it was due to this flag and the SocketTimeoutException i 
mentioned.

I agree the documentation should at least mention the risks of setting this 
flags and for which exceptions it considers the data as corrupt. Right now I 
believe this flag is not even documented officially, is it? 
https://spark.apache.org/docs/latest/configuration.html

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-05 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353229#comment-16353229
 ] 

Imran Rashid commented on SPARK-23308:
--

well I think the complaint is that you end up losing data completely silently 
... but OTOH, you're sort of accepting that possibility when you say 
`ignoreCorruptFiles`.  So I am inclined to think the right answer is, set 
`ignoreCorruptFiles=false`.  Perhaps the doc can be improved a little.

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353152#comment-16353152
 ] 

Sean Owen commented on SPARK-23308:
---

I get that the difference is that a re-read might succeed, but I also wonder 
how common this is, and how much it's a problem to spend time retrying what may 
be a permanent failure, in other implementations.

I have not heard this come up before as an issue in another implementation.

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353082#comment-16353082
 ] 

Márcio Furlani Carmona commented on SPARK-23308:


Yeah! The tricky thing is to know which exception is actually a corrupted data 
or not.

In our case we have a custom FileSystem implementation that access S3 using the 
AWS-SDK. We noticed the issue because we randomly missed part of the data from 
some files. That was happening quite frequently (about 50% of our runs) 
specially because of the large number of files we are reading from. And yes, 
not the rest of the content was ignored but just a piece of it, like a few 
thousand lines, this happens mostly like because we're using partitioned files 
and we just lose a partition of the original file but not the whole file.

In our case we should be able to add retries in our custom FS implementation, 
but I believe this might affect other standard network dependent FileSystems.

This is a stack trace example:

 
{code:java}
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at 
org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
at 
org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:200)
at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at 
com.amazonaws.services.s3.internal.S3AbortableInputStream.read(S3AbortableInputStream.java:125)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at java.security.DigestInputStream.read(DigestInputStream.java:161)
at 
com.amazonaws.services.s3.internal.DigestValidationInputStream.read(DigestValidationInputStream.java:59)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
com.amazonaws.services.s3.internal.crypto.CipherLiteInputStream.nextChunk(CipherLiteInputStream.java:219)
at 
com.amazonaws.services.s3.internal.crypto.CipherLiteInputStream.read(CipherLiteInputStream.java:118)
at 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
...
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:186)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1793)
at 

[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-05 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352744#comment-16352744
 ] 

Imran Rashid commented on SPARK-23308:
--

I think the problem is that its really tricky to know which exceptions look 
like a corrupted file and which don't, especially as {{readFunction()}} is 
somewhat generic there.  Yeah, SocketTimeException is perhaps worth a retry ...

Did you run into this?  Do you have a full stack trace of when you got a 
SocketTimeoutException and the rest of the content was ignored?

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349543#comment-16349543
 ] 

Sean Owen commented on SPARK-23308:
---

Sure, but what could you meaningfully do with SocketTimeoutException that is 
different?

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Márcio Furlani Carmona
>Priority: Major
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org