[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-04-24 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
Reynold, I know very much about the time of reviewers, I put 1+h a day on 
the hadoop codebase reviewing stuff, generally trying to review the work of 
non-colleagues, so as to pull in the broad set of contributions which are 
needed.. 

I have been trying to get some object store related patches into spark 
alongside the foundational work in fundamentally transforming how we work with 
object storage, especially S3, in Hadoop. Without the spark side changes, a lot 
gets lost: here the performance is approx 100-300mS/file when scanning an 
object store. 

here I've split things in two, docs and diff. Both are independent, both 
are reasonably tractable. If they can be reviewed fast and added, there's no 
problems of patches ageing, everyone having to resync.

We can get this out the way, and you've have fewer reasons to be unhappy 
with me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-04-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14731
  
Steve I think the main point is you should also respect the time of 
reviewers. The way most of your pull requests manifest have been suboptimal: 
they often start with a very early WIP (which is not necessarily a problem), 
and once in a while (e.g. a month or two) you update it to almost completely 
change it. The time itself is a problem. It requires a lot of context switching 
to review your pull requests. In addition, every time you update it it looks 
like a complete new giant pull request.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-04-24 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
Ok. what is the way? Do I write a formal proposal?

Because right now there is no reliable way to get the full dependency graph 
of Spark + hadoop cloud JARs + direct cloud provider JARs (azure,aws) and their 
dependencies (jackson) in sync. 

Which means that getting Spark to talk to object stores is more miss than 
hit.

I'm happy to follow the proposal mechanism, including progress reports , 
but I do at least need some kind of hope that my work will actually get in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-04-08 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
@srowen anything else I need to do here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-03-29 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
Is there anything else I need to do here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74990/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #74990 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74990/testReport)**
 for PR 14731 at commit 
[`a3aaf26`](https://github.com/apache/spark/commit/a3aaf267d2ac30c012b4a71b7a80e28a49ff10be).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-03-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #74990 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74990/testReport)**
 for PR 14731 at commit 
[`a3aaf26`](https://github.com/apache/spark/commit/a3aaf267d2ac30c012b4a71b7a80e28a49ff10be).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-03-20 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
Any more comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-03-10 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
The Hadoop FS Spec has now been updated to declare exactly what HDFS does 
w.r.t timestamps, and warn that what other filesystems and object stores do are 
implementation and installation specific features: 
[filesystem.md](https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md)

That is the associated documentation update with this one; some of the 
content there was originally here, but moved over to the hadoop docs for the 
HDFS team to take the blame for when it changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-03-02 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/14731
  
@srowen Waiting for your final OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73434/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #73434 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73434/testReport)**
 for PR 14731 at commit 
[`724495b`](https://github.com/apache/spark/commit/724495b97c1521ae5bd4c284d911c5ae6f51b19c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-02-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73433/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #73433 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73433/testReport)**
 for PR 14731 at commit 
[`04f4967`](https://github.com/apache/spark/commit/04f49679b3f4f3e2d99e7cafeb9e4fa91fe98ece).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #73434 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73434/testReport)**
 for PR 14731 at commit 
[`724495b`](https://github.com/apache/spark/commit/724495b97c1521ae5bd4c284d911c5ae6f51b19c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-02-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #73433 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73433/testReport)**
 for PR 14731 at commit 
[`04f4967`](https://github.com/apache/spark/commit/04f49679b3f4f3e2d99e7cafeb9e4fa91fe98ece).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-02-24 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
@uncleGen: reviewed this, tweaked the docs slightly but otherwise, there's 
nothing left to do that I can see


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71866/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #71866 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71866/testReport)**
 for PR 14731 at commit 
[`06b2bee`](https://github.com/apache/spark/commit/06b2beec75084db1ee330fa4ff4d50775d9f540c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-23 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
@uncleGen I've updated it. Note that 
[HADOOP-13946](https://issues.apache.org/jira/browse/HADOOP-13946) tracks the 
changes in the Hadoop docs, which writes down what HDFS actually does, then 
note how cloud object stores have no consistent behaviour w.r.t. 
timestamps.While I personally believe that direct PUT calls is the way to write 
data, there's still ambiguity then as to when the objects get a timestamp (S3 : 
when the PUT/multipart put is first initiated, and not updated on the close() 
if the put was started earlier) —so when they become visible. So: I don't go 
into the details, just say "look at the docs, then test on your system". That's 
about as authoritative as you can get


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #71866 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71866/testReport)**
 for PR 14731 at commit 
[`06b2bee`](https://github.com/apache/spark/commit/06b2beec75084db1ee330fa4ff4d50775d9f540c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-23 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
let me do a quick review & update


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-21 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/14731
  
@steveloughran Are you still working on this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-04 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
Sean, I think I've managed to delete the lines where you were asking about 
globs

> Am I right that the net change here is not an optimization but an 
expansion of the behavior to support globs rather than single dirs?

There's no changes in this source to change the expansion policy; that went 
in with [SPARK-14976](https://issues.apache.org/jira/browse/SPARK-14976), "make 
StreamingContext.textFileStream support wildcard". This updates the docs to 
cover what goes on (the wildcard covering directories, but not the files inside 
them), and makes the scan much more efficient on object stores. No changes in 
the semantics of what gets found or when things get found


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70819/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #70819 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70819/testReport)**
 for PR 14731 at commit 
[`a9a6f7b`](https://github.com/apache/spark/commit/a9a6f7b9e3876e551a2568b6220559992db40228).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2017-01-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #70819 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70819/testReport)**
 for PR 14731 at commit 
[`a9a6f7b`](https://github.com/apache/spark/commit/a9a6f7b9e3876e551a2568b6220559992db40228).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-10-14 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
@srowen have you got any comments on the last patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66656/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #66656 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66656/consoleFull)**
 for PR 14731 at commit 
[`f8ed8a3`](https://github.com/apache/spark/commit/f8ed8a3551d1eed5db5a22f5eeb484614036fefe).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-10-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-10-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #66656 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66656/consoleFull)**
 for PR 14731 at commit 
[`f8ed8a3`](https://github.com/apache/spark/commit/f8ed8a3551d1eed5db5a22f5eeb484614036fefe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65592/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #65592 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65592/consoleFull)**
 for PR 14731 at commit 
[`57f697d`](https://github.com/apache/spark/commit/57f697dc718e536f512c856b8e6c8239e1133fd5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #65592 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65592/consoleFull)**
 for PR 14731 at commit 
[`57f697d`](https://github.com/apache/spark/commit/57f697dc718e536f512c856b8e6c8239e1133fd5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65498/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #65498 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65498/consoleFull)**
 for PR 14731 at commit 
[`735fc7c`](https://github.com/apache/spark/commit/735fc7c2343c08a323e3d213e611830e3b41ef04).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #65498 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65498/consoleFull)**
 for PR 14731 at commit 
[`735fc7c`](https://github.com/apache/spark/commit/735fc7c2343c08a323e3d213e611830e3b41ef04).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-01 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
The latest patch pulls out the shortcutting of the globStatus call if 
there's no wildcard chars in the path; closer to the original patch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64662/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64662 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64662/consoleFull)**
 for PR 14731 at commit 
[`b60f175`](https://github.com/apache/spark/commit/b60f175b5ef058ed24b3ddaf9a85b899a5e33187).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64662 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64662/consoleFull)**
 for PR 14731 at commit 
[`b60f175`](https://github.com/apache/spark/commit/b60f175b5ef058ed24b3ddaf9a85b899a5e33187).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64534/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64534 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64534/consoleFull)**
 for PR 14731 at commit 
[`4134620`](https://github.com/apache/spark/commit/4134620210e28a2e182397a9bc94ccb8c4d5ffc4).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64534 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64534/consoleFull)**
 for PR 14731 at commit 
[`4134620`](https://github.com/apache/spark/commit/4134620210e28a2e182397a9bc94ccb8c4d5ffc4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64488/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64488 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64488/consoleFull)**
 for PR 14731 at commit 
[`fe40bd2`](https://github.com/apache/spark/commit/fe40bd2bf548ca973f9dcdf9426fb9834828f72b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64488 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64488/consoleFull)**
 for PR 14731 at commit 
[`fe40bd2`](https://github.com/apache/spark/commit/fe40bd2bf548ca973f9dcdf9426fb9834828f72b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64486 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64486/consoleFull)**
 for PR 14731 at commit 
[`9bc0ea9`](https://github.com/apache/spark/commit/9bc0ea9734ccaf11c6306a3496c98be8cc20faab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64486 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64486/consoleFull)**
 for PR 14731 at commit 
[`9bc0ea9`](https://github.com/apache/spark/commit/9bc0ea9734ccaf11c6306a3496c98be8cc20faab).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64486/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64368/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64368 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64368/consoleFull)**
 for PR 14731 at commit 
[`b63abfe`](https://github.com/apache/spark/commit/b63abfe32a5509f69c7f725a46b2e6ac8fb9cf1f).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-24 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
Having looked at the source code, `FileSystem.globStatus()` uses the glob 
patterns, which are not the same as the posix regexp ones. 
[org.apache.hadoop.fs.GlobPattern](http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-common/2.7.1/org/apache/hadoop/fs/GlobPattern.java#81)
 does the conversion.

For the docs, I'll just use a wildcard * in the example, rather than try 
anything more sophisticated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64368 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64368/consoleFull)**
 for PR 14731 at commit 
[`b63abfe`](https://github.com/apache/spark/commit/b63abfe32a5509f69c7f725a46b2e6ac8fb9cf1f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-23 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
The logic has got complex enough it merits unit tests. Pulling into 
SparkHadoopUtils itself and writing some for the possible: simple, glob matches 
one , glob matches 1+, glob doesn't match, file not found


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64296/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64296 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64296/consoleFull)**
 for PR 14731 at commit 
[`79b57a2`](https://github.com/apache/spark/commit/79b57a2683dece86e1acd063b2d33fa5a6dd6038).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-23 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
1. updated the code to bypass the glob routine when there is no wildcard; 
this bypasses something fairly inefficient. 
1. reporting FNFE on that base dir differently; skip the stack trace 
(maybe: log at a lower level?). 
1. Updated the docs with a special list of blobstore best practises.

It's a bit hard to get some of that phrasing of what the wildcard does 
right; needs careful review.

Tested using my s3 streaming test, which did use a * in the wildcard. All 
works, but no improvements in speed on what is a fairly unrealistic structure. 
The time to recursively list object stores remotely is tangibly slow. Maybe 
that should go in the text too: "it can be take seconds to scan object stores 
for new data, with the time being proportional to directory depth and the 
number of files in a directory. Shallow and wide directory trees are faster"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64296 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64296/consoleFull)**
 for PR 14731 at commit 
[`79b57a2`](https://github.com/apache/spark/commit/79b57a2683dece86e1acd063b2d33fa5a6dd6038).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-23 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
I've now done the [s3a streaming 
test/example](https://github.com/steveloughran/spark/blob/features/SPARK-7481-cloud/cloud/src/main/scala/org/apache/spark/cloud/s3/examples/S3Streaming.scala)

this uses a pattern of s3a/path/sub* as the directory path; then creates a 
file in a directory and renames the dir to match the path; verifies that the 
file was found in the time period allocated

https://gist.github.com/steveloughran/c8b39a7b87a9bd63d7a383bda8687e7e


Notable that the scan of the empty dir took 150ms; once there's data in the 
tree the time jumps up to 500ms once there are two entries under the tree, one 
dir and one file

summary stats show 72 getFileStatus calls at the FS API, mapping to 140 
HEAD calls and 88 LIST operations. 

```
 S3AFileSystem{uri=s3a://stevel-ireland-new, 
workingDir=s3a://hwdev-steve-ireland-new/user/stevel, inputPolicy=sequential, 
partSize=104857600, enableMultiObjectsDelete=true, maxKeys=5000, 
readAhead=65536, blockSize=1048576, multiPartThreshold=2147483647, statistics 
{292 bytes read, 292 bytes written, 101 read ops, 0 large read ops, 11 write 
ops}, metrics {{Context=S3AFileSystem} 
{FileSystemId=343b706a-c238-4d71-9ed8-8083601ac28a-hwdev-steve-ireland-new} 
{fsURI=s3a://hwdev-steve-ireland-new} {files_created=1} {files_copied=1} 
{files_copied_bytes=292} {files_deleted=1} {directories_created=3} 
{directories_deleted=0} {ignored_errors=2} {op_copy_from_local_file=0} 
{op_exists=1} {op_get_file_status=72} {op_glob_status=16} {op_is_directory=0} 
{op_is_file=0} {op_list_files=0} {op_list_located_status=0} {op_list_status=27} 
{op_mkdirs=2} {op_rename=1} {object_copy_requests=0} {object_delete_requests=3} 
{object_list_requests=88} {object_continue_list_requests=0} 
{object_metadata_requests=1
 40} {object_multipart_aborted=0} {object_put_bytes=292} 
{object_put_requests=4} {stream_read_fully_operations=0} 
{stream_bytes_skipped_on_seek=0} {stream_bytes_backwards_on_seek=0} 
{stream_bytes_read=292} {streamOpened=1} {stream_backward_seek_pperations=0} 
{stream_read_operations_incomplete=0} {stream_bytes_discarded_in_abort=0} 
{stream_close_operations=1} {stream_read_operations=1} {stream_aborted=0} 
{stream_forward_seek_operations=0} {streamClosed=1} {stream_seek_operations=0} 
{stream_bytes_read_in_close=0} {stream_read_exceptions=0} }}
```

I'm going to do a test run with the modification here and see what it does 
to listing and status


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-23 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
Actually, I've just noticed that DStream behaviour isn't in sync with the 
streaming programming guide, which says "files written in nested directories 
not supported)". That is: SPARK-14796 didn't patch the docs.

it may as well be fixed in this patch. How about, in the bullet points 
underneath

- Wildcards may be used to specify a set of directories to scan for new 
files, for example `hdfs://nn1:8050/users/alice/logs/2016-*/*.gz`
-
-New directories and their contents will be discovered as they arrive

Special points for object stores
- Wildcard lookup may be very slow with some object stores.
 - Directory rename is not atomic; if a directory is renamed into the 
streaming source, then the files within may only be discovered and process 
across a multiple streaming windows.
- 

+ there's another optimisation; use the {{SparkHadoopUtils.isGlobPath()}} 
predicate to recognise when the dir path isn't a wildcard, in which case just 
do a simple listFiles()}}. Until that shortcutting is done automatically in the 
Hadoop FS implementation, spark can do it on its side. As the {{listFiles()}} 
call was what was used before SPARK-14796, it has to be compatible, else 
SPARK-14796 has broken things



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-23 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
LGTM. I was trying to see if there was a way to create a good test here by 
triggering the takes-too-long codepath and having a counter, but there's no 
obvious way to do that deterministically. I am doing a test for this against s3 
in the spark-cloud module I'm writing; I can look at the printed counts of 
getFileStatus before/after the patch to see the difference, but the actual 
(testable) metrics are only accessible with forthcoming Hadoop 2.8 release.

TL;DR: no easy test, so there's nothing left to do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14731
  
This is ready to go right @steveloughran ? LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64156 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64156/consoleFull)**
 for PR 14731 at commit 
[`b08e3c9`](https://github.com/apache/spark/commit/b08e3c9937a63a08b274a1491ea7064168646f1d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64156/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64156 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64156/consoleFull)**
 for PR 14731 at commit 
[`b08e3c9`](https://github.com/apache/spark/commit/b08e3c9937a63a08b274a1491ea7064168646f1d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64142/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64142 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64142/consoleFull)**
 for PR 14731 at commit 
[`6e8ace0`](https://github.com/apache/spark/commit/6e8ace0444ec9bdebc7c809a08628891f6de5fd0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14731
  
Ah right, you already have the modification time for free. Sounds good, 
remove the caching.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64142 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64142/consoleFull)**
 for PR 14731 at commit 
[`6e8ace0`](https://github.com/apache/spark/commit/6e8ace0444ec9bdebc7c809a08628891f6de5fd0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
to be precise: the caching of file modification times is superfluous. It's 
there to avoid the cost of executing `getFileStatus()` on previously scanned 
files. Once you use the FileStatus returned in a listing, you aren't calling 
`getFileStatus()`, hence: no need to cache


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14731
  
Why is the caching superfluous -- because no file is evaluated more than 
once here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/14731
  
# I'm going to scan through and tune them elsewhere; really I'm going by 
uses of the listFiles calls

There's actually no significant use elsewhere that I can see; just a couple 
of uses which filter on filename —so there is no cost penalty.

* `SparkHadoopUtil.listLeafStatuses()` does implement its own directory 
recursion to find files; FileSystem.listFiles(path, true) does that, and on S3A 
will do flat scan that is O(files/5000); no directory overhead at all.
* Otherwise, globStatus() can be pretty slow against object stores, but the 
fix there isn't in the client code; it means someone needs to implement 
[HADOOP-13371](https://issues.apache.org/jira/browse/HADOOP-13371), *S3A 
globber to use bulk listObject call over recursive directory scan* —more 
specifically, an implementation scalable to production datasets. 

Returning to this patch, should I cut out the caching? I think it is 
superfluous. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14731
  
LGTM. Does this sort of change make sense elsewhere where `PathFilter` is 
used? I glanced at the others and it looked like a wash in other cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64140/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64140 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64140/consoleFull)**
 for PR 14731 at commit 
[`738c51b`](https://github.com/apache/spark/commit/738c51bb57f331c58a877aa20aa5e2beb1084114).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #64140 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64140/consoleFull)**
 for PR 14731 at commit 
[`738c51b`](https://github.com/apache/spark/commit/738c51bb57f331c58a877aa20aa5e2beb1084114).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org