[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-16 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/18979
  
thanks for the review everyone!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18979
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82745/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #82745 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82745/testReport)**
 for PR 18979 at commit 
[`c0e81a1`](https://github.com/apache/spark/commit/c0e81a1c87011efdc010f1c9ba28dde003458667).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #82745 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82745/testReport)**
 for PR 18979 at commit 
[`c0e81a1`](https://github.com/apache/spark/commit/c0e81a1c87011efdc010f1c9ba28dde003458667).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/18979
  
done. Not writing 0-byte files will offer significant speedup against 
object stores, where the cost of a call to getFileStatus() can take hundreds of 
millis. I look forward to it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18979
  
Could you resolve the conflicts again?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #82731 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82731/testReport)**
 for PR 18979 at commit 
[`649f8da`](https://github.com/apache/spark/commit/649f8da245443567b842b697a4d47e5241eb5946).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82731/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82732/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #82732 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82732/testReport)**
 for PR 18979 at commit 
[`d3f96f6`](https://github.com/apache/spark/commit/d3f96f63f263f134a79e76e8bdd1961f333b7c7a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18979
  
Hi, @steveloughran .
> is the issue with ORC that if there's nothing to write, it doesn't 
generate a file (so avoiding that issue with sometimes you get 0-byte ORC files 
& things downstream fail)?

Yes, So far, Spark leave an empty directory in case of ORC.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82730/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #82730 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82730/testReport)**
 for PR 18979 at commit 
[`adab985`](https://github.com/apache/spark/commit/adab985d0455f3f549ebceb03528f0a6a45a31c0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/18979
  
The latest PR update pulls in @dongjoon-hyun's new test; to avoid merge 
conflict in the Insert suite I've rebased against master.

1.  Everything handles missing files on output
2. There's only one logInfo at the end of the execute call, so if many 
empty files are created, the logs aren't too noisy.
3. There is now some implicit counting of how many files were missing `= 
submittedFiles - numFiles`; this isn't aggregated and reported though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #82732 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82732/testReport)**
 for PR 18979 at commit 
[`d3f96f6`](https://github.com/apache/spark/commit/d3f96f63f263f134a79e76e8bdd1961f333b7c7a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #82731 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82731/testReport)**
 for PR 18979 at commit 
[`649f8da`](https://github.com/apache/spark/commit/649f8da245443567b842b697a4d47e5241eb5946).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #82730 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82730/testReport)**
 for PR 18979 at commit 
[`adab985`](https://github.com/apache/spark/commit/adab985d0455f3f549ebceb03528f0a6a45a31c0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/18979
  
Noted :)
@dongjoon-hyun : is the issue with ORC that if there's nothing to write, it 
doesn't generate a file (so avoiding that issue with sometimes you get 0-byte 
ORC files & things downstream fail)?

If so, the warning message which @gatorsmile has proposed is potentially 
going to mislead people into worrying about a problem which isn't there. and 
the numFiles metric is going to mislead.

I'm starting to worry about how noisy the log would be, both there and when 
working with s3 when it's playing delayed visibility (rarer).

1. What if this patch just logged at debug: less noise, but still something 
there if people are trying to debug a mismatch?
1. if there's no file found, numFiles doesn't get incremented. 
1. I count the number of files actually submitted
1. And in `getFinalStats()` log @ info if there is a mismatch

This would line things up in future for actually returning the list of 
expected vs actual files up as a metric where it could be reported.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-13 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18979
  
Gentle ping, @steveloughran !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-11 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18979
  
Could you also include the [test 
cases](https://github.com/dongjoon-hyun/spark/blob/b545f281b19120cc2c9e4197cae4b1315969247d/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala#L2054-L2060)
 to 
[InsertSuite.scala](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala)
 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-11 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18979
  
+1. This solves the regression on writing emtpy dataset with ORC format, 
too!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-11 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18979
  
LGTM except a minor comment. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-11 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/18979
  
@viirya : the new data writer API will allow for a broader set of stats to 
be propagated back from workers. When you are working with the object stores, 
an useful stat to get back is throttle count & retry count as they can be the 
cause of why things are slow ... and if it is due to throttling, throwing more 
workers at the job will actually slow things down. They'd be the ones to look 
at first


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-11 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18979
  
I don't have strong opinion against this. Incorrect size is an issue but I 
can't think a better solution for now...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-11 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18979
  
Will review it tomorrow


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-10 Thread adrian-ionescu
Github user adrian-ionescu commented on the issue:

https://github.com/apache/spark/pull/18979
  
To me, this looks good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-10 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/18979
  
Has anyone had a look at this recently? 

The problem still exists, and while downstream filesystems can address if 
they recognise the use case & lie about values, they will be returning invalid 
values to the caller: spark will be reporting the wrong values. At least with 
this PR Spark will get to make the decisions about how to react itself.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-22 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/18979
  
Related to this, updated spec on [Hadoop output stream, Syncable and 
StreamCapabilities](https://github.com/steveloughran/hadoop/blob/s3/HADOOP-13327-outputstream-trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md).

As the doc notes, object stores != filesystems, and while a lot can be done 
to preserve the metaphor on input, its on output where CRUD inconsistencies 
surface. along with the logic as "does a 0-byte file get created in create()", 
"when is data written?", etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80841/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #80841 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80841/testReport)**
 for PR 18979 at commit 
[`f778213`](https://github.com/apache/spark/commit/f778213d3adee1f2d6c977f92c093930f3d6c013).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #80841 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80841/testReport)**
 for PR 18979 at commit 
[`f778213`](https://github.com/apache/spark/commit/f778213d3adee1f2d6c977f92c093930f3d6c013).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-18 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/18979
  
@adrian-ionescu wrote
> is there a need for calling getFinalStats() more than once?

No. As long as everyone is aware of it, it won't be an issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-18 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/18979
  
> To mimic S3-like behavior, you can overwrite the file system 
spark.hadoop.fs.$scheme.impl"

@gatorsmile: you will be able to do something better soon, as S3A is adding 
an inconsistent AWS client into `hadoop-aws` JAR, which you can then enable to 
guarantee consistency delays and inject intermittent faults into the system 
(throttling, transient network events). All it will take is a config option to 
switch to this client, plus the chaos-monkey-esque probabilities and delays. 
This is what I'm already using —you will be able to as well. That is, no need 
to switch clients, just 
go`spark.hadoop.fs.s3a.s3.client.factory.impl=org.apache.hadoop.fs.s3a.InconsistentS3ClientFactory`
 and wait for the stack traces.

The S3A FS itself [needs to do 
more](https://issues.apache.org/jira/browse/HADOOP-14531) to handle throttling 
& failures (retry, add failure metrics so throttling & error rates can be 
measured).  Knowing throttling rates is important as it will help identify perf 
problems due to bad distribution of work across a bucket, excess use of KMS key 
lookup..., things that in surface in support calls.

This patch restores Spark 2.3 to the behaviour it has in Spark 2.2: a brief 
delay between object creation and visibility does not cause the task to fail


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-18 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/18979
  
Currently *nobody should be using s3a:// at the the temp file destination*, 
which is the same as saying "nobody should be using s3a:// as the direct 
destination of work", not without a special committer (Netflix, IBM's stocator, 
...) or without something to give S3 list consistency. Because today, task 
commit relies on a list & rename of all files in the task attempt dir, and if 
you don't get list consistency, you can miss out on files. If you ever hear 
anyone complaining "it takes too long to commit to s3" then they are using it 
this way. Tell them to use a consistency layer or to stop it :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18979
  
Btw, as the file path passed to state tracker should be task temp file, is 
it common to directly use S3 as temp file output destination?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-18 Thread adrian-ionescu
Github user adrian-ionescu commented on the issue:

https://github.com/apache/spark/pull/18979
  
Thanks for the fix and tests, @steveloughran!
Re 1. -- is there a need for calling `getFinalStats()` more than once? The 
function doc clearly states that it's not supported and may lead to undefined 
behaviour. Could be fixed, of course, but depending on the implementation of 
the stats tracker, that can be at the expense of additional memory or code 
complexity..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18979
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80803/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #80803 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80803/testReport)**
 for PR 18979 at commit 
[`2a113fd`](https://github.com/apache/spark/commit/2a113fde1653743a3543df8ada395f320b826a3e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-17 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18979
  
To mimic S3-like behavior, you can overwrite the file system 
`spark.hadoop.fs.$scheme.impl`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-17 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/18979
  
cc @adrian-ionescu 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-08-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18979
  
**[Test build #80803 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80803/testReport)**
 for PR 18979 at commit 
[`2a113fd`](https://github.com/apache/spark/commit/2a113fde1653743a3543df8ada395f320b826a3e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org