[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-17 Thread pwoody
Github user pwoody commented on the issue:

https://github.com/apache/spark/pull/15835
  
Hey - yeah definitely a real concern as it needs driver heap to scale with 
the size of the metadata of the table you are going to read in.

We could be creative to add heuristics around reading the metadata at all 
if it is too large or add the ability to spill the existing metadata to disk. 
Do you have any preferences/thoughts?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15835
  
This creates huge problems when the table is big doesn't it? We just did a 
big change to get rid of the per table file status cache, because its existence 
made Spark unstable with dealing with tables (tables with a lot of files).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-17 Thread pwoody
Github user pwoody commented on the issue:

https://github.com/apache/spark/pull/15835
  
I've updated the structure of the PR to change caching to be global across 
instances of FileFormat, have expiry, and reuse known filters. Here is a new 
benchmark to highlight the filter re-use (I flipped the results to make it 
easier to read)

```
withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") {
  withTempPath { path =>
spark.range(0, 200, 1, 200)
  .write.parquet(path.getCanonicalPath)
val benchmark = new Benchmark("Parquet partition pruning 
benchmark", 200)
benchmark.addCase("Parquet partition pruning disabled") { iter =>
  withSQLConf(SQLConf.PARQUET_PARTITION_PRUNING_ENABLED.key -> 
"false") {
var df = spark.read.parquet(path.getCanonicalPath).filter("id = 
0")
for (i <- 1 to 10) {
  df = df.filter(s"id < $i")
  df.collect()
}
  }
}
benchmark.addCase("Parquet partition pruning enabled") { iter =>
  var df = spark.read.parquet(path.getCanonicalPath).filter("id = 
0")
  for (i <- 1 to 10) {
df = df.filter(s"id < $i")
df.collect()
  }
}
benchmark.run()
  }
}
```


```
Running benchmark: Parquet partition pruning benchmark
  Running case: Parquet partition pruning disabled
  Stopped after 2 iterations, 8744 ms
  Running case: Parquet partition pruning enabled
  Stopped after 5 iterations, 2187 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz

Parquet partition pruning benchmark: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


Parquet partition pruning disabled4332 / 4372  0.0
21659450.2   1.0X
Parquet partition pruning enabled  399 /  438  0.0 
1995877.1  10.9X
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15835
  
I guess we should ping @liancheng as he was reviewing the previous one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-14 Thread pwoody
Github user pwoody commented on the issue:

https://github.com/apache/spark/pull/15835
  
@rxin @HyukjinKwon ready for more review on my end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-11 Thread pwoody
Github user pwoody commented on the issue:

https://github.com/apache/spark/pull/15835
  
I've pushed up the ability to configure this feature being enabled as well. 
Here is a benchmark when writing out 200 files with this code:

```
withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") {
  withTempPath { path =>
spark.range(0, numPartitions, 1, numPartitions)
  .write.parquet(path.getCanonicalPath)
val benchmark = new Benchmark("Parquet partition pruning 
benchmark", numPartitions)

benchmark.addCase("Parquet partition pruning enabled") { iter =>
  spark.read.parquet(path.getCanonicalPath).filter("id = 
0").collect()
}

benchmark.addCase("Parquet partition pruning disabled") { iter =>
  withSQLConf(SQLConf.PARQUET_PARTITION_PRUNING_ENABLED.key -> 
"false") {
spark.read.parquet(path.getCanonicalPath).filter("id = 
0").collect()
  }
}

benchmark.run()
  }
}
```


```
Running benchmark: Parquet partition pruning benchmark
  Running case: Parquet partition pruning enabled
  Stopped after 12 iterations, 2049 ms
  Running case: Parquet partition pruning disabled
  Stopped after 5 iterations, 2177 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26 on Mac OS X 10.10.5
Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz

Parquet partition pruning benchmark: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


Parquet partition pruning enabled  145 /  171  0.0  
723119.3   1.0X
Parquet partition pruning disabled 414 /  436  0.0 
2070279.4   0.3X


Process finished with exit code 0

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-11 Thread pwoody
Github user pwoody commented on the issue:

https://github.com/apache/spark/pull/15835
  
Cool - I've added the caching, fixed style issues, and added pruning to the 
bucketed reads.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-10 Thread pwoody
Github user pwoody commented on the issue:

https://github.com/apache/spark/pull/15835
  
Ah awesome thanks - my IDE settings must not be properly catching the style 
- will fix and make sure this doesn't happen in the future.

Also just took a look at the bucketing. I've implemented the same logic, 
which will prune the file if no splits exist, but it is still all or nothing 
for a given file because it launches tasks based on bucket. Commit coming soon.

Thanks again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15835
  
Ah, I missed the last comment from the old PR. Okay, we can make this 
shaped nicer. BTW, Spark collects small partitions for each task so I guess 
this would not introduce a lot of tasks always but yes, I guess it is still a 
valid point to reduce the number of tasks.

Right, I am fine with this. I thought the original PR was taken over 
without the courtesy of giving a notification or talking about this ahead. For 
the benchmark, I have a PR with a benchmark in a PR, 14660, which I also 
referred from other PRs.

I have just few minor notes which are, maybe `Closes #14649` can be added 
at the end of the PR description so that the merge script from committers could 
close the original one if this one gets merged. Another one is, there are some 
style guide lines I usually refer which are 
[wiki](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide)
 and 
[databricks/scala-style-guide](https://github.com/databricks/scala-style-guide).
 I will leave some comments on the changed files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-10 Thread pwoody
Github user pwoody commented on the issue:

https://github.com/apache/spark/pull/15835
  
Hey @HyukjinKwon - appreciate the feedback!

Re: file touching - If I add the cache to the `_metadata` file, then this 
PR will end up touching at most one file per rootPath driver side (generally 
just one total).

Re: files v.s. splits - The main difference when pruning splits instead of 
files is when you have larger files you will end up spawning tasks that 
immediately will be filtered out executor-side after grabbing the footer. For 
simplicity, if we have maxSplitBytes == Parquet row group size then a single 
hit in a file will end up spawning a task for every row group even if the file 
only has one matching block. This overhead can end up being expensive in a 
setup w/ dynamicAllocation and multi-tenancy. I generally wish to reduce the 
total number of tasks.

Re: bucketed - yep sorry, will fix!

Re: benchmarks - yeah totally happy to poke around and make some 
benchmarks. ParquetReadBenchmark is the appropriate place I suppose?

Re: old PR - the code there is out of date and I don't believe Andrew is 
actively working on it based on his last comment. This is the follow up to 
original ER in what I believe is a more comprehensive and reliable way.

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15835
  
I think we should cc @liancheng as well who is insightful in this area.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15835
  
Hi @pwoody, So, if I understood this correctly, the original PR only 
filters out the files to touch ahead but this one proposes also to filter 
splits via offsets from Parquet's metadata in driver-side, right?

IIRC, each task will already read the footer before it actually starts to 
read in executors and then will drop the splits at Parquet-side. This happens 
fine in both the Spark's vectorized parquet reader and normal parquet reader 
too. It might be worth reducing the files to touch for the reason I and other 
guys described in the original PR but I am not sure of pruning splits ahead.

Another potential problem I see here is It seems it does not consider 
bucketed table read whereas the original PR does.

In addition, I guess we really need a benchmark for the proposal to improve 
the performance. It is fine if I am wrong and this PR has a benchmark showing 
the performance improvement with a reasonable explanation.

Lastly, I guess it is a followup including the changes proposed in 14649. 
Maybe, we can wait until that is merged before submitting a followup. I guess 
it is being reviewed and I think @andreweduffy is still echoing fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-09 Thread pwoody
Github user pwoody commented on the issue:

https://github.com/apache/spark/pull/15835
  
There is a performance issue here in that we re-fetch the metadata file for 
each file. My understanding is that FileFormat is meant to be stateless, but if 
we can add in a cache then that would be very useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15835
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15835: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-09 Thread pwoody
Github user pwoody commented on the issue:

https://github.com/apache/spark/pull/15835
  
@andreweduffy @robert3005 @HyukjinKwon 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org