[GitHub] spark pull request #15835: [SPARK-17059][SQL] Allow FileFormat to specify pa...

2016-11-10 Thread andreweduffy
Github user andreweduffy commented on a diff in the pull request:

https://github.com/apache/spark/pull/15835#discussion_r87503167
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -703,6 +703,34 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-17059: Allow FileFormat to specify partition pruning 
strategy") {
+withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") {
+  withTempPath { path =>
+spark.sparkContext.parallelize(Seq(1, 2, 3), 3)
+  .toDF("x").write.parquet(path.getCanonicalPath)
+
+val zeroPartitions = 
spark.read.parquet(path.getCanonicalPath).where("x = 0")
+assert(zeroPartitions.rdd.partitions.length == 0)
+
+val onePartition = 
spark.read.parquet(path.getCanonicalPath).where("x = 1")
+assert(onePartition.rdd.partitions.length == 1)
+  }
+}
+  }
+
+  test("Do not filter out parquet file when missing in _metadata file") {
--- End diff --

I'm not sure I understand what this test accomplishes. If you write out 
twice with summary metadata, why would anything be missing in `_metadata` file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15835: [SPARK-17059][SQL] Allow FileFormat to specify pa...

2016-11-10 Thread andreweduffy
Github user andreweduffy commented on a diff in the pull request:

https://github.com/apache/spark/pull/15835#discussion_r87500313
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -272,6 +275,83 @@ class ParquetFileFormat
 true
   }
 
+  override def getSplits(fileIndex: FileIndex,
+ fileStatus: FileStatus,
+ filters: Seq[Filter],
+ schema: StructType,
+ hadoopConf: Configuration): Seq[FileSplit] = {
+if (filters.isEmpty) {
+  // Return immediately to save FileSystem overhead
+  super.getSplits(fileIndex, fileStatus, filters, schema, hadoopConf)
+} else {
+  val filePath = fileStatus.getPath
+  val rootOption: Option[Path] = fileIndex.rootPaths
+.find(root => filePath.toString.startsWith(root.toString))
+  val metadataOption = rootOption.flatMap(getMetadataForPath(filePath, 
_, hadoopConf))
+  // If the metadata exists, filter the splits.
+  // Otherwise, fall back to the default implementation.
+  metadataOption
+.map(filterToSplits(fileStatus, _, rootOption.get, filters, 
schema, hadoopConf))
+.getOrElse(super.getSplits(fileIndex, fileStatus, filters, schema, 
hadoopConf))
+}
+  }
+
+  private def filterToSplits(fileStatus: FileStatus,
+ metadata: ParquetMetadata,
+ metadataRoot: Path,
+ filters: Seq[Filter],
+ schema: StructType,
+ hadoopConf: Configuration): Seq[FileSplit] = {
+val metadataBlocks = metadata.getBlocks
+val parquetSchema = metadata.getFileMetaData.getSchema
+val filter = FilterCompat.get(filters
+  .flatMap(ParquetFilters.createFilter(schema, _))
+  .reduce(FilterApi.and))
+val filteredMetadata =
+  RowGroupFilter.filterRowGroups(filter, metadataBlocks, 
parquetSchema).asScala
+filteredMetadata.flatMap(bmd => {
+  val bmdPath = new Path(metadataRoot, bmd.getPath)
+  val fsPath = fileStatus.getPath
+  if (bmdPath == fsPath) {
+Some(new FileSplit(bmdPath, bmd.getStartingPos, 
bmd.getTotalByteSize, Array.empty))
+  } else {
+None
+  }
+})
+  }
+
+  private def getMetadataForPath(filePath: Path,
--- End diff --

I think this should be cacheable, as the lifetime of a `FileFormat` is tied 
to a particular `HadoopFsRelation`, which again is tied to a `Dataset`. 
Something may have changed about the API but from a quick look through the 
relevant classes I don't see anything to refute this. I don't see any written 
guarantees in the FileFormat API that concrete impls need to be stateless, but 
neither did I find any reference to caching/state in other implementations. 
Still I think it should be safe to cache on the ParquetFileFormat for this use 
case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15835: [SPARK-17059][SQL] Allow FileFormat to specify pa...

2016-11-10 Thread andreweduffy
Github user andreweduffy commented on a diff in the pull request:

https://github.com/apache/spark/pull/15835#discussion_r87499298
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -272,6 +275,83 @@ class ParquetFileFormat
 true
   }
 
+  override def getSplits(fileIndex: FileIndex,
+ fileStatus: FileStatus,
+ filters: Seq[Filter],
+ schema: StructType,
+ hadoopConf: Configuration): Seq[FileSplit] = {
+if (filters.isEmpty) {
+  // Return immediately to save FileSystem overhead
+  super.getSplits(fileIndex, fileStatus, filters, schema, hadoopConf)
+} else {
+  val filePath = fileStatus.getPath
+  val rootOption: Option[Path] = fileIndex.rootPaths
+.find(root => filePath.toString.startsWith(root.toString))
+  val metadataOption = rootOption.flatMap(getMetadataForPath(filePath, 
_, hadoopConf))
+  // If the metadata exists, filter the splits.
+  // Otherwise, fall back to the default implementation.
+  metadataOption
+.map(filterToSplits(fileStatus, _, rootOption.get, filters, 
schema, hadoopConf))
+.getOrElse(super.getSplits(fileIndex, fileStatus, filters, schema, 
hadoopConf))
+}
+  }
+
+  private def filterToSplits(fileStatus: FileStatus,
+ metadata: ParquetMetadata,
+ metadataRoot: Path,
+ filters: Seq[Filter],
+ schema: StructType,
+ hadoopConf: Configuration): Seq[FileSplit] = {
+val metadataBlocks = metadata.getBlocks
+val parquetSchema = metadata.getFileMetaData.getSchema
+val filter = FilterCompat.get(filters
+  .flatMap(ParquetFilters.createFilter(schema, _))
+  .reduce(FilterApi.and))
+val filteredMetadata =
+  RowGroupFilter.filterRowGroups(filter, metadataBlocks, 
parquetSchema).asScala
+filteredMetadata.flatMap(bmd => {
+  val bmdPath = new Path(metadataRoot, bmd.getPath)
+  val fsPath = fileStatus.getPath
+  if (bmdPath == fsPath) {
+Some(new FileSplit(bmdPath, bmd.getStartingPos, 
bmd.getTotalByteSize, Array.empty))
+  } else {
+None
+  }
+})
+  }
+
+  private def getMetadataForPath(filePath: Path,
+ rootPath: Path,
+ conf: Configuration): 
Option[ParquetMetadata] = {
+val fs = rootPath.getFileSystem(conf)
+try {
+  val stat = fs.getFileStatus(rootPath)
+  // Mimic Parquet behavior. If given a directory, find the underlying 
_metadata file
+  // If given a single file, check the parent directory for a 
_metadata file
+  val directory = if (stat.isDirectory) stat.getPath else 
stat.getPath.getParent
+  val metadataFile = new Path(directory, 
ParquetFileWriter.PARQUET_METADATA_FILE)
+  val metadata =
+ParquetFileReader.readFooter(conf, metadataFile, 
ParquetMetadataConverter.NO_FILTER)
+
+  // Ensure that the metadata has an entry for the file.
+  // If it does not, do not filter at this stage.
+  val matchingBlockExists = metadata.getBlocks.asScala.exists(bmd => {
--- End diff --

I'm a little unclear as to why this branch is necessary. Under what 
conditions do you have a parquet dataset where one of its files isn't in the 
summary metadata? Is this the case where you write out the dataset with 
metadata, then add a new partition, and write without updating the metadata?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-11-10 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14649
  
Yep yep, closing this one


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...

2016-11-10 Thread andreweduffy
Github user andreweduffy closed the pull request at:

https://github.com/apache/spark/pull/14649


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-10-31 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14649
  
Hey @paulata, sounds like you want to make the pruner pluggable at runtime. 
That sounds like it would be good, though IMO is a bit orthogonal to this 
change. This is more of an optimization than an FR, i.e. you're "no worse off" 
with this change performance-wise and all pre-existing behavior is preserved, 
whereas yours is a new feature.

You're welcome to reuse the code in this PR for your own needs, or to base 
your changes on top of this. I have my doubts that the PR in its current form 
will be merged into Spark, but it's already been merged into Palantir's fork 
(https://github.com/palantir/spark/pull/40) so hopefully the folks there will 
try and upstream it at some point in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-09-27 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14671
  
In light of @HyukjinKwon's benchmark it seems like Spark-side filtering is 
the right thing to do here, so I think this should be good?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-09-27 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14649
  
Glad that helped, sorry if it wasn't more clear. Agreed that writing 
summary metadata isn't always the best. In this patch, it only ever performs 
the file pruning if the _metadata file exists for the dataset. At work we have 
it enabled since we have a query-heavy workload where new data lands 
occasionally. 

Sent from Outlook




On Tue, Sep 27, 2016 at 10:13 AM -0700, "Cheng Lian" 
<notificati...@github.com> wrote:












    @andreweduffy @andreweduffy Thanks for the explanations! This makes much 
more sense to me now. 



Although _metadata can be neat for the read path, it's a trouble maker for 
the write path:


Writing summary files (either _metadata or _common_metadata) can be quite 
expensive when writing a large Parquet dataset since it reads footers from all 
files and tries to merge them. This can be especially frustrating when 
appending a small amount of data to an existing large dataset.
Parquet doesn't always write the summary files even if you explicitly set 
parquet.enable.summary-metadata to true. For example, when two files have 
different values of a single key in the user-defined key/value metadata 
section, Parquet simply gives up writing the summary files and delete existing 
ones. This may be quite common in the case of schema evolution. What makes it 
worse, outdated _common_metadata might not be deleted properly due to 
PARQUET-359, which makes the summary files out of sync.




However, I still agree that with an existing trustworthy _metadata file at 
hand, this patch is still very useful. I'll take a deeper look at this tomorrow.



—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.


  
  












---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-09-27 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14649
  
Hyun mostly sums it up. This uses the summary metadata for Parquet when 
available. Rather than performing row group level filtering, it actually 
filters out entire files when summary metadata is available. It does this when 
it's constructing the FileScanRDD, which means it actually only spawns tasks 
for files that match the predicate. At work we were running into issues with S3 
deployments where very large S3 datasets would take exceedingly long to load in 
Spark. Empirically, we're running this exact patch in production and for many 
types of queries, we see a very large decrease in tasks created and time spent 
fetching from S3. So this is mainly for the use case of short-lived RDDs (so 
doing .persist doesn't help you) that are backed by data in S3 (so eliminating 
read time is actually a significant speed up)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-09-06 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14671
  
@davies Row-level filtering doesn't occur with the vectorized reader, which 
is now enabled by default


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-09-06 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14671
  
cool, ping to @davies @cloud-fan would either of you be able to look at 
this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-08-30 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14649
  
addressed typos


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-30 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14671
  
@HyukjinKwon would you like to file a separate ticket for benchmarking? 
It's pretty orthogonal to this PR, see rdblue's comment above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-22 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14671
  
cc @davies @cloud-fan for parquet change, seems I got @rdblue's stamp of 
approval


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-08-22 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14649
  
cc @davies @cloud-fan as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-18 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14671
  
@rxin Did you get the chance to take a closer look at this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-17 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14671
  
Yeah benchmarking is definitely a great idea, as it is likely Spark will be 
better than Parquet at filtering individual records, but I'm still not quite 
understanding why this filter is any different and should block on row-by-row 
filtering decision. _All_ filters are being processed row-by-row using 
ParquetRecordReader to my understanding, and this one is no different from any 
of the others in ParquetFilters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-17 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14671
  
That is true, but currently all filters are being pushed down to row-by-row 
anyway when not using the vectorized reader, so I'm unclear why the IN filter 
is special


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to O...

2016-08-17 Thread andreweduffy
Github user andreweduffy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14671#discussion_r75092317
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -369,7 +369,7 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 
   test("SPARK-11103: Filter applied on merged Parquet schema with new 
column fails") {
 import testImplicits._
-Seq("true", "false").map { vectorized =>
+Seq("true", "false").foreach { vectorized =>
--- End diff --

Yep, maybe this belongs in a separate PR as a followup to the previous one, 
I just changed it because Intellij was yelling at me :P


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-17 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14671
  
Thanks for the comments guys! Had to search through some code, but I think 
I understand the current state of things. Correct me if I'm wrong, but it seems 
that record-by-record filtering only occurs if the vectorized reader is 
disabled, as there is no logic in SpecificParquetRecordReaderBase to perform 
individual record filtering, so currently this will only be happening if we 
fallback to the Parquet-provided ParquetRecordReader.

After doing some digging into ParquetRecordReader, it pushes down the 
"record-by-record" filter to InternalParquetRecordReader. However, it appears 
that inside of the InternalParquetRecordReader, you can actually disable 
row-by-row filtering, with the magic conf `parquet.filter.record-level.enabled` 
([declaration](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L121)
 and 
[usage](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L180))

So in theory if we set this configuration to false when we construct 
ParquetRecordReader in 
[ParquetFileFormat](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L381),
 we wouldn't have to worry about row-by-row filtering being applied along 
either code path (I think).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14671
  
cc @HyukjinKwon @rdblue for Parquet-related change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14671: SPARK-17091: ParquetFilters rewrite IN to OR of E...

2016-08-16 Thread andreweduffy
Github user andreweduffy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14671#discussion_r75008463
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -40,20 +40,20 @@ import org.apache.spark.util.{AccumulatorContext, 
LongAccumulator}
  * NOTE:
  *
  * 1. `!(a cmp b)` is always transformed to its negated form `a cmp' b` by 
the
- *`BooleanSimplification` optimization rule whenever possible. As a 
result, predicate `!(a < 1)`
- *results in a `GtEq` filter predicate rather than a `Not`.
+ *`BooleanSimplification` optimization rule whenever possible. As a 
result, predicate
+ *`!(a < 1)` results in a `GtEq` filter predicate rather than a `Not`.
  *
  * 2. `Tuple1(Option(x))` is used together with `AnyVal` types like `Int` 
to ensure the inferred
  *data type is nullable.
  */
 class ParquetFilterSuite extends QueryTest with ParquetTest with 
SharedSQLContext {
 
   private def checkFilterPredicate(
-  df: DataFrame,
-  predicate: Predicate,
-  filterClass: Class[_ <: FilterPredicate],
-  checker: (DataFrame, Seq[Row]) => Unit,
-  expected: Seq[Row]): Unit = {
+df: DataFrame,
--- End diff --

Sorry about the indentation fixes, I can revert all of these.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14671: SPARK-17091: ParquetFilters rewrite IN to OR of E...

2016-08-16 Thread andreweduffy
GitHub user andreweduffy opened a pull request:

https://github.com/apache/spark/pull/14671

SPARK-17091: ParquetFilters rewrite IN to OR of Eq

## What changes were proposed in this pull request?

Allow for pushdown of `IN` clauses. Previous implementations relied upon 
custom user defined predicates in Parquet, instead here we just convert an IN 
over a set to an OR over a set of equality expressions, which can be pushed 
down properly to Parquet.

## How was this patch tested?

Unit tests from previous PR's, specifically #10278. They pass with the 
change and fail when the case block is commented out, indicating the pushdown 
is successfully being applied in Parquet. Because it is a disjunction of 
equality checks this should be applied at the row group level.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andreweduffy/spark pushdown-in

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14671.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14671


commit 7679285bceedb41c7d6390c069f0a852804c8cf3
Author: Andrew Duffy <r...@aduffy.org>
Date:   2016-08-16T17:57:15Z

SPARK-17091: ParquetFilters rewrite IN to OR of Eq




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-08-16 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14649
  
(Thanks Sean for whitelisting!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-08-16 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14649
  
cc @rxin @liancheng Let me know what you think


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-08-16 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14649
  
Can we rerun tests please? Issue should be fixed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-08-16 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14649
  
Yep, that's it. Effectively it just reduces task overhead for querying 
large parquet datasets and turns filter queries where the filter never applies 
into a no-op, no tasks generated (only if summary metadata is available)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

2016-08-16 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14649
  
Let me see if I can answer these in order:

> So, if my understanding is correct, Parquet filters rowgroups in both 
normal reader and vectorized reader already (#13701). Is this doing the same 
thing in Spark-side?

Yep, this attempts to do Footer processing before launching tasks, if the 
_metadata file is available.

> Also, doesn't this try to touch many files in driver-side?

This should only be touching the _metadata file, it reads all the Footers 
that are collected in there to determine file paths to prune out. It needs the 
_metadata and not the _common_metadata because _common_metadata doesn't include 
footer info.

> Also, if my understanding is correct, we are picking up only single file 
to read footer (see ParquetFileFormat.scala#L217-L225) unless we merge schemas. 
So, it seems, due to this reason, writing _metadata or _common_metadata is 
disabled (See https://issues.apache.org/jira/browse/SPARK-15719).

Yes we have generation of the summary-metadata disabled by default, but it 
is configurable via `parquet.enable.summary-metadata` conf parameter. If the 
_metadata file isn't found it simply falls back to the original partitioning 
structure and does no file pruning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...

2016-08-15 Thread andreweduffy
Github user andreweduffy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14649#discussion_r74860656
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -703,6 +703,16 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-17059: Allow Allow FileFormat to specify partition pruning 
strategy") {
--- End diff --

Typo will fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...

2016-08-15 Thread andreweduffy
GitHub user andreweduffy opened a pull request:

https://github.com/apache/spark/pull/14649

[SPARK-17059][SQL] Allow FileFormat to specify partition pruning strategy

## What changes were proposed in this pull request?

Different FileFormat implementations may be able to make intelligent 
decisions about files that will need to be processed as part of a 
FileSourceScanExec based on the `pushedFilters` available. This PR implements a 
way to do that pluggably, with an implementation for Parquet that allows 
applying the summary metadata to prevent task creation.

This has a few performance benefits:

1. Reading of files is generally slow, especially for S3. In the current 
Parquet implementation the summary metadata is not used and so the footers are 
read directly. This can be very slow for large Parquet datasets, as even as of 
Hadoop 2.7 S3 reads will read the entire file by default (random reads are 
configurable only starting on 2.7 onwards, and is disabled by default)
2. Partitions that are found to contain no files can then be pruned out or 
coalesced depending on the FileFormat's implementation, allowing for fewer 
tasks being created.

## How was this patch tested?

Existing tests and Spark Shell, plus unit test for the Parquet 
implementation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andreweduffy/spark parquet-task-pruning

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14649.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14649


commit 46657980c00ce36c03e39540bd4399613f439e04
Author: Andrew Duffy <r...@aduffy.org>
Date:   2016-07-27T16:47:20Z

Allow FileFormat to specify file pruning




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter pushdow...

2016-08-10 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14159
  
Yep, closing now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter ...

2016-08-10 Thread andreweduffy
Github user andreweduffy closed the pull request at:

https://github.com/apache/spark/pull/14159


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter pushdow...

2016-08-10 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14159
  
Actually it appears that since this was opened, later PR #14450 fixes this. 
Should be safe to close now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14201: [SPARK-14702] Make environment of SparkLauncher launched...

2016-07-19 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14201
  
Thanks @vanzin !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14201: [SPARK-14702] Make environment of SparkLauncher launched...

2016-07-18 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14201
  
Yuck, sorry, switching between projects my style settings were being 
carried over, that should be it for indentation fixes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter pushdow...

2016-07-18 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14159
  
@hvanhovell Anything new on this front?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13953: [SPARK-16265][SUBMIT] Addition of --driver-jre Fl...

2016-07-18 Thread andreweduffy
Github user andreweduffy closed the pull request at:

https://github.com/apache/spark/pull/13953


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14201: [SPARK-14702] Make environment of SparkLauncher launched...

2016-07-15 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14201
  
Gotcha, sorry about the force overwrites, in some projects I've worked on 
that's preferred to keep clean history, though for review I can see why 
separate units is easier. Thanks again for taking the time to review this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14201: [SPARK-14702] Make environment of SparkLauncher l...

2016-07-15 Thread andreweduffy
Github user andreweduffy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14201#discussion_r71036075
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java ---
@@ -359,6 +369,83 @@ public SparkLauncher setVerbose(boolean verbose) {
   }
 
   /**
+   * Sets the working directory of spark-submit.
+   *
+   * @param dir The directory to set as spark-submit's working directory.
+   * @return This launcher.
+ */
+  public SparkLauncher directory(File dir) {
+workingDir = dir;
+return this;
+  }
+
+  /**
+   * Specifies that stderr in spark-submit should be redirected to stdout.
+   *
+   * @return This launcher.
+   */
+  public SparkLauncher redirectError() {
+redirectErrorStream = true;
+return this;
+  }
+
+  /**
+   * Redirects error output to the specified Redirect.
+   *
+   * @param to The method of redirection.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectError(ProcessBuilder.Redirect to) {
+errorStream = to;
+return this;
+  }
+
+  /**
+   * Redirects standard output to the specified Redirect.
+   *
+   * @param to The method of redirection.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectOutput(ProcessBuilder.Redirect to) {
+outputStream = to;
+return this;
+  }
+
+  /**
+   * Redirects error output to the specified File.
+   *
+   * @param errFile The file to which stderr is written.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectError(File errFile) {
+errorStream = ProcessBuilder.Redirect.to(errFile);
+return this;
+  }
+
+  /**
+   * Redirects error output to the specified File.
+   *
+   * @param outFile The file to which stdout is written.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectOutput(File outFile) {
+outputStream = ProcessBuilder.Redirect.to(outFile);
+return this;
+  }
+
+  /**
+   * Sets all output to be logged and redirected to a logger with the 
specified name.
+   *
+   * @param loggerName The name of the logger to log stdout and stderr.
+   * @return This launcher.
+ */
+  public SparkLauncher redirectToLog(String loggerName) {
--- End diff --

Yep, that's what I'm saying with the second point above, so those two 
should probably be good enough


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14201: [SPARK-14702] Make environment of SparkLauncher l...

2016-07-15 Thread andreweduffy
Github user andreweduffy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14201#discussion_r71031747
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java ---
@@ -359,6 +369,83 @@ public SparkLauncher setVerbose(boolean verbose) {
   }
 
   /**
+   * Sets the working directory of spark-submit.
+   *
+   * @param dir The directory to set as spark-submit's working directory.
+   * @return This launcher.
+ */
+  public SparkLauncher directory(File dir) {
+workingDir = dir;
+return this;
+  }
+
+  /**
+   * Specifies that stderr in spark-submit should be redirected to stdout.
+   *
+   * @return This launcher.
+   */
+  public SparkLauncher redirectError() {
+redirectErrorStream = true;
+return this;
+  }
+
+  /**
+   * Redirects error output to the specified Redirect.
+   *
+   * @param to The method of redirection.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectError(ProcessBuilder.Redirect to) {
+errorStream = to;
+return this;
+  }
+
+  /**
+   * Redirects standard output to the specified Redirect.
+   *
+   * @param to The method of redirection.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectOutput(ProcessBuilder.Redirect to) {
+outputStream = to;
+return this;
+  }
+
+  /**
+   * Redirects error output to the specified File.
+   *
+   * @param errFile The file to which stderr is written.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectError(File errFile) {
+errorStream = ProcessBuilder.Redirect.to(errFile);
+return this;
+  }
+
+  /**
+   * Redirects error output to the specified File.
+   *
+   * @param outFile The file to which stdout is written.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectOutput(File outFile) {
+outputStream = ProcessBuilder.Redirect.to(outFile);
+return this;
+  }
+
+  /**
+   * Sets all output to be logged and redirected to a logger with the 
specified name.
+   *
+   * @param loggerName The name of the logger to log stdout and stderr.
+   * @return This launcher.
+ */
+  public SparkLauncher redirectToLog(String loggerName) {
--- End diff --

So I had an extra check in the last commit, thought there was a comment 
that it should be removed though, maybe I should just clarify here. I see 2 
possible checks that we should be doing here:

* Ensure that only one of `redirectError(...)` and `redirectError()` can be 
used (done)
* Ensure that `redirectToLog()` is never used with any other `redirect*` 
method (removed in this diff, can be added back)

Are there others that I'm missing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter pushdow...

2016-07-15 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14159
  
bump?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14201: [SPARK-14702] Make environment of SparkLauncher l...

2016-07-14 Thread andreweduffy
Github user andreweduffy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14201#discussion_r70905418
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java ---
@@ -359,6 +364,82 @@ public SparkLauncher setVerbose(boolean verbose) {
   }
 
   /**
+   * Sets the working directory of the driver process.
+   * @param dir The directory to set as the driver's working directory.
+   * @return This launcher.
+ */
+  public SparkLauncher directory(File dir) {
+builder.workingDir = dir;
+return this;
+  }
+
+  /**
+   * Specifies that stderr in the driver should be redirected to stdout.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectError() {
+builder.redirectErrorStream = true;
+return this;
+  }
+
+  /**
+   * Redirects error output to the specified Redirect.
+   * @param to The method of redirection.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectError(ProcessBuilder.Redirect to) {
+builder.errorStream = to;
+return this;
+  }
+
+  /**
+   * Redirects standard output to the specified Redirect.
+   * @param to The method of redirection.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectOutput(ProcessBuilder.Redirect to) {
+builder.outputStream = to;
+return this;
+  }
+
+  /**
+   * Redirects error output to the specified File.
+   * @param errFile The file to which stderr is written.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectError(File errFile) {
+builder.errorStream = ProcessBuilder.Redirect.to(errFile);
+return this;
+  }
+
+  /**
+   * Redirects error output to the specified File.
+   * @param outFile The file to which stdout is written.
+   * @return This launcher.
+   */
+  public SparkLauncher redirectOutput(File outFile) {
+builder.outputStream = ProcessBuilder.Redirect.to(outFile);
+return this;
+  }
+
+  /**
+   * Sets all output to be logged and redirected to a logger with the 
specified name.
+   * @param loggerName The name of the logger to log stdout and stderr.
+   * @return This launcher.
+ */
+  public SparkLauncher redirectToLog(String loggerName) {
+try {
+  // NOTE: the below ordering is important, so builder.redirectToLog 
is only set to true iff
+  // the preceding put() finishes without exception.
+  builder.getEffectiveConfig().put(CHILD_PROCESS_LOGGER_NAME, 
loggerName);
--- End diff --

Should I also modify `startApplication` to read from builder.conf? It 
appears to use `builder.getEffectiveConfig()` which as far as I can tell is 
sourced from a properties file


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14201: [SPARK-14702] Make environment of SparkLauncher l...

2016-07-14 Thread andreweduffy
Github user andreweduffy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14201#discussion_r70904921
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java ---
@@ -82,8 +83,12 @@
   /** Used internally to create unique logger names. */
   private static final AtomicInteger COUNTER = new AtomicInteger();
 
+  public static final ThreadFactory REDIRECTOR_FACTORY = new 
NamedThreadFactory("launcher-proc-%d");
--- End diff --

How should this be shared do you think without making it public static in 
either SparkLauncher or ChildProcApphandle?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14201: [SPARK-14702] Make environment of SparkLauncher l...

2016-07-14 Thread andreweduffy
Github user andreweduffy commented on a diff in the pull request:

https://github.com/apache/spark/pull/14201#discussion_r70888965
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java ---
@@ -82,8 +83,12 @@
   /** Used internally to create unique logger names. */
   private static final AtomicInteger COUNTER = new AtomicInteger();
 
+  public static final ThreadFactory REDIRECTOR_FACTORY = new 
NamedThreadFactory("launcher-proc-%d");
+
   static final Map<String, String> launcherConfig = new HashMap<>();
 
+  private OutputRedirector redirector; // Holds the possible 
OutputRedirector
--- End diff --

alright that was a concern of mine, will remove it in next patch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14201: [SPARK-14702] Expose SparkLauncher's ProcessBuilder for ...

2016-07-14 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14201
  
I'll go ahead and replace use of Throwables with throw new RTE, not clear 
why it can't find Guava on the classpath however


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14185: [SPARK-14702][SUBMIT] Expose SparkLauncher's Proc...

2016-07-14 Thread andreweduffy
Github user andreweduffy closed the pull request at:

https://github.com/apache/spark/pull/14185


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14201: [SPARK-14702] Expose SparkLauncher's ProcessBuilder for ...

2016-07-14 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14201
  
Replaces #14185, @vanzin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14201: [SPARK-14702] Expose SparkLauncher's ProcessBuild...

2016-07-14 Thread andreweduffy
GitHub user andreweduffy opened a pull request:

https://github.com/apache/spark/pull/14201

[SPARK-14702] Expose SparkLauncher's ProcessBuilder for user flexibility

## What changes were proposed in this pull request?

Adds a few public methods to `SparkLauncher` to allow configuring some 
extra features of the `ProcessBuilder`, including the working directory, output 
and error stream redirection.

## How was this patch tested?

Unit testing + simple Spark driver programs

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andreweduffy/spark feature/launcher

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14201.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14201


commit 5a35963bc4661193d8c169c7f98fffe24b181124
Author: Andrew Duffy <adu...@palantir.com>
Date:   2016-07-14T11:57:15Z

Make environment of SparkLauncher launched process more configurable




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14185: [SPARK-14702][SUBMIT] Expose SparkLauncher's ProcessBuil...

2016-07-14 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14185
  
That sounds reasonable, I can take a shot at that today, I'll submit it as 
a second PR and then we can close this one out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14185: [SPARK-16511][SUBMIT] Expose SparkLauncher's Proc...

2016-07-13 Thread andreweduffy
GitHub user andreweduffy opened a pull request:

https://github.com/apache/spark/pull/14185

[SPARK-16511][SUBMIT] Expose SparkLauncher's ProcessBuilder for user

## What changes were proposed in this pull request?

Expose the `ProcessBuilder` in `SparkLauncher`, as well as opening up 
`SparkLauncher#startApplication` to allow accepting arbitrary ProcessBuilders.

For more info see [https://issues.apache.org/jira/browse/SPARK-16511](this 
JIRA ticket)

Also covers [https://issues.apache.org/jira/browse/SPARK-14702](SPARK-14702)

## How was this patch tested?

Unit tests + spark shell




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andreweduffy/spark feature/processbuilder

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14185.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14185


commit 7fe36f5970e7e577a47d8b6a7534cc95d22a94c2
Author: Andrew Duffy <adu...@palantir.com>
Date:   2016-07-13T14:20:54Z

[SPARK-16511][SUBMIT] Expose SparkLauncher's ProcessBuilder for user
flexibility

Also covers https://issues.apache.org/jira/browse/SPARK-16511




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter pushdow...

2016-07-13 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/14159
  
Yep, looks like the other one was closed by the committer. I saw Sean 
commented that this might need to be tested against 2.2, is that going to be 
necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14159: [PARQUET] Fix for Parquet filter pushdown

2016-07-12 Thread andreweduffy
GitHub user andreweduffy opened a pull request:

https://github.com/apache/spark/pull/14159

[PARQUET] Fix for Parquet filter pushdown

## What changes were proposed in this pull request?

Fix parquet filter pushdown from not reaching all the way down to the file 
level

Use of previous deprecated constructor defaults to null metadata, which
prevents pushdown from reaching the Parquet level.

## How was this patch tested?

Looking at output of collects from SparkShell, before were printing 
warnings about CorruptStatistics, preventing pushing down filters to individual 
parquet files. Now able to use the metadata in each file to pushdown.

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andreweduffy/spark bugfix/pushdown

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14159.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14159


commit f825ad709cdc3c89d0cc7e41d0410998e6cc7541
Author: Andrew Duffy <adu...@palantir.com>
Date:   2016-07-12T19:41:22Z

Fix for Parquet filter pushdown

Use of previous deprecated constructor defaults to null metadata, which
prevents pushdown from reaching the Parquet level.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13953: [SPARK-16265][SUBMIT] Addition of --driver-jre Flag to S...

2016-07-01 Thread andreweduffy
Github user andreweduffy commented on the issue:

https://github.com/apache/spark/pull/13953
  
@petermaxlee can you give an example of how it can "get into a lot of 
problems"? The use case seems weird if you're usually developing on hardware 
that you control, but makes more sense if you're developing a Spark application 
and then want to give it to someone to run on their own cluster. It integrates 
with `SparkLauncher` so that you can spawn cluster jobs that use the correct 
version of Java without forcing the recipient of the app to futz with their 
cluster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13953: [SPARK-16265][SUBMIT] Addition of --driver-jre Fl...

2016-06-28 Thread andreweduffy
GitHub user andreweduffy opened a pull request:

https://github.com/apache/spark/pull/13953

[SPARK-16265][SUBMIT] Addition of --driver-jre Flag to SparkSubmit

## What changes were proposed in this pull request?

1. Addition of `--driver-jre` flag to `SparkSubmitArguments`
2. Now when `--driver-jre` is specified to `SparkSubmit` and it is 
connecting to a YARN master, the driver will zip up a copy of its in-use JRE 
and ship it to the YARN cluster as part of `spark.yarn.dist.archives`. Then by 
setting `spark.executorEnv.JAVA_HOME` and `spark.appMasterEnv.JAVA_HOME` to 
point to their local copies, use of specific versions of Java can be had even 
on YARN clusters that do not have those versions pre-installed.

## How was this patch tested?

Compiled the Spark examples with Java 8, did a Java 8 `spark-submit 
--driver-jre` of `SparkPi`  to a Java 7 YARN cluster, executors and AM used 
Java 8 without issue. Running without shipping the JRE results in errors due to 
Java version differences.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andreweduffy/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13953.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13953


commit abd80d1bacd66eea6dfeb33fde3caa34bd7c9c7d
Author: Andrew Duffy <r...@aduffy.org>
Date:   2016-06-28T10:44:46Z

Addition of --driver-jre Flag to SparkSubmit




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org