[GitHub] spark pull request #15835: [SPARK-17059][SQL] Allow FileFormat to specify pa...
Github user andreweduffy commented on a diff in the pull request: https://github.com/apache/spark/pull/15835#discussion_r87503167 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -703,6 +703,34 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } } + + test("SPARK-17059: Allow FileFormat to specify partition pruning strategy") { +withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") { + withTempPath { path => +spark.sparkContext.parallelize(Seq(1, 2, 3), 3) + .toDF("x").write.parquet(path.getCanonicalPath) + +val zeroPartitions = spark.read.parquet(path.getCanonicalPath).where("x = 0") +assert(zeroPartitions.rdd.partitions.length == 0) + +val onePartition = spark.read.parquet(path.getCanonicalPath).where("x = 1") +assert(onePartition.rdd.partitions.length == 1) + } +} + } + + test("Do not filter out parquet file when missing in _metadata file") { --- End diff -- I'm not sure I understand what this test accomplishes. If you write out twice with summary metadata, why would anything be missing in `_metadata` file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15835: [SPARK-17059][SQL] Allow FileFormat to specify pa...
Github user andreweduffy commented on a diff in the pull request: https://github.com/apache/spark/pull/15835#discussion_r87500313 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -272,6 +275,83 @@ class ParquetFileFormat true } + override def getSplits(fileIndex: FileIndex, + fileStatus: FileStatus, + filters: Seq[Filter], + schema: StructType, + hadoopConf: Configuration): Seq[FileSplit] = { +if (filters.isEmpty) { + // Return immediately to save FileSystem overhead + super.getSplits(fileIndex, fileStatus, filters, schema, hadoopConf) +} else { + val filePath = fileStatus.getPath + val rootOption: Option[Path] = fileIndex.rootPaths +.find(root => filePath.toString.startsWith(root.toString)) + val metadataOption = rootOption.flatMap(getMetadataForPath(filePath, _, hadoopConf)) + // If the metadata exists, filter the splits. + // Otherwise, fall back to the default implementation. + metadataOption +.map(filterToSplits(fileStatus, _, rootOption.get, filters, schema, hadoopConf)) +.getOrElse(super.getSplits(fileIndex, fileStatus, filters, schema, hadoopConf)) +} + } + + private def filterToSplits(fileStatus: FileStatus, + metadata: ParquetMetadata, + metadataRoot: Path, + filters: Seq[Filter], + schema: StructType, + hadoopConf: Configuration): Seq[FileSplit] = { +val metadataBlocks = metadata.getBlocks +val parquetSchema = metadata.getFileMetaData.getSchema +val filter = FilterCompat.get(filters + .flatMap(ParquetFilters.createFilter(schema, _)) + .reduce(FilterApi.and)) +val filteredMetadata = + RowGroupFilter.filterRowGroups(filter, metadataBlocks, parquetSchema).asScala +filteredMetadata.flatMap(bmd => { + val bmdPath = new Path(metadataRoot, bmd.getPath) + val fsPath = fileStatus.getPath + if (bmdPath == fsPath) { +Some(new FileSplit(bmdPath, bmd.getStartingPos, bmd.getTotalByteSize, Array.empty)) + } else { +None + } +}) + } + + private def getMetadataForPath(filePath: Path, --- End diff -- I think this should be cacheable, as the lifetime of a `FileFormat` is tied to a particular `HadoopFsRelation`, which again is tied to a `Dataset`. Something may have changed about the API but from a quick look through the relevant classes I don't see anything to refute this. I don't see any written guarantees in the FileFormat API that concrete impls need to be stateless, but neither did I find any reference to caching/state in other implementations. Still I think it should be safe to cache on the ParquetFileFormat for this use case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15835: [SPARK-17059][SQL] Allow FileFormat to specify pa...
Github user andreweduffy commented on a diff in the pull request: https://github.com/apache/spark/pull/15835#discussion_r87499298 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -272,6 +275,83 @@ class ParquetFileFormat true } + override def getSplits(fileIndex: FileIndex, + fileStatus: FileStatus, + filters: Seq[Filter], + schema: StructType, + hadoopConf: Configuration): Seq[FileSplit] = { +if (filters.isEmpty) { + // Return immediately to save FileSystem overhead + super.getSplits(fileIndex, fileStatus, filters, schema, hadoopConf) +} else { + val filePath = fileStatus.getPath + val rootOption: Option[Path] = fileIndex.rootPaths +.find(root => filePath.toString.startsWith(root.toString)) + val metadataOption = rootOption.flatMap(getMetadataForPath(filePath, _, hadoopConf)) + // If the metadata exists, filter the splits. + // Otherwise, fall back to the default implementation. + metadataOption +.map(filterToSplits(fileStatus, _, rootOption.get, filters, schema, hadoopConf)) +.getOrElse(super.getSplits(fileIndex, fileStatus, filters, schema, hadoopConf)) +} + } + + private def filterToSplits(fileStatus: FileStatus, + metadata: ParquetMetadata, + metadataRoot: Path, + filters: Seq[Filter], + schema: StructType, + hadoopConf: Configuration): Seq[FileSplit] = { +val metadataBlocks = metadata.getBlocks +val parquetSchema = metadata.getFileMetaData.getSchema +val filter = FilterCompat.get(filters + .flatMap(ParquetFilters.createFilter(schema, _)) + .reduce(FilterApi.and)) +val filteredMetadata = + RowGroupFilter.filterRowGroups(filter, metadataBlocks, parquetSchema).asScala +filteredMetadata.flatMap(bmd => { + val bmdPath = new Path(metadataRoot, bmd.getPath) + val fsPath = fileStatus.getPath + if (bmdPath == fsPath) { +Some(new FileSplit(bmdPath, bmd.getStartingPos, bmd.getTotalByteSize, Array.empty)) + } else { +None + } +}) + } + + private def getMetadataForPath(filePath: Path, + rootPath: Path, + conf: Configuration): Option[ParquetMetadata] = { +val fs = rootPath.getFileSystem(conf) +try { + val stat = fs.getFileStatus(rootPath) + // Mimic Parquet behavior. If given a directory, find the underlying _metadata file + // If given a single file, check the parent directory for a _metadata file + val directory = if (stat.isDirectory) stat.getPath else stat.getPath.getParent + val metadataFile = new Path(directory, ParquetFileWriter.PARQUET_METADATA_FILE) + val metadata = +ParquetFileReader.readFooter(conf, metadataFile, ParquetMetadataConverter.NO_FILTER) + + // Ensure that the metadata has an entry for the file. + // If it does not, do not filter at this stage. + val matchingBlockExists = metadata.getBlocks.asScala.exists(bmd => { --- End diff -- I'm a little unclear as to why this branch is necessary. Under what conditions do you have a parquet dataset where one of its files isn't in the summary metadata? Is this the case where you write out the dataset with metadata, then add a new partition, and write without updating the metadata? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 Yep yep, closing this one --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...
Github user andreweduffy closed the pull request at: https://github.com/apache/spark/pull/14649 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 Hey @paulata, sounds like you want to make the pruner pluggable at runtime. That sounds like it would be good, though IMO is a bit orthogonal to this change. This is more of an optimization than an FR, i.e. you're "no worse off" with this change performance-wise and all pre-existing behavior is preserved, whereas yours is a new feature. You're welcome to reuse the code in this PR for your own needs, or to base your changes on top of this. I have my doubts that the PR in its current form will be merged into Spark, but it's already been merged into Palantir's fork (https://github.com/palantir/spark/pull/40) so hopefully the folks there will try and upstream it at some point in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 In light of @HyukjinKwon's benchmark it seems like Spark-side filtering is the right thing to do here, so I think this should be good? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 Glad that helped, sorry if it wasn't more clear. Agreed that writing summary metadata isn't always the best. In this patch, it only ever performs the file pruning if the _metadata file exists for the dataset. At work we have it enabled since we have a query-heavy workload where new data lands occasionally. Sent from Outlook On Tue, Sep 27, 2016 at 10:13 AM -0700, "Cheng Lian" <notificati...@github.com> wrote: @andreweduffy @andreweduffy Thanks for the explanations! This makes much more sense to me now. Although _metadata can be neat for the read path, it's a trouble maker for the write path: Writing summary files (either _metadata or _common_metadata) can be quite expensive when writing a large Parquet dataset since it reads footers from all files and tries to merge them. This can be especially frustrating when appending a small amount of data to an existing large dataset. Parquet doesn't always write the summary files even if you explicitly set parquet.enable.summary-metadata to true. For example, when two files have different values of a single key in the user-defined key/value metadata section, Parquet simply gives up writing the summary files and delete existing ones. This may be quite common in the case of schema evolution. What makes it worse, outdated _common_metadata might not be deleted properly due to PARQUET-359, which makes the summary files out of sync. However, I still agree that with an existing trustworthy _metadata file at hand, this patch is still very useful. I'll take a deeper look at this tomorrow. â You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 Hyun mostly sums it up. This uses the summary metadata for Parquet when available. Rather than performing row group level filtering, it actually filters out entire files when summary metadata is available. It does this when it's constructing the FileScanRDD, which means it actually only spawns tasks for files that match the predicate. At work we were running into issues with S3 deployments where very large S3 datasets would take exceedingly long to load in Spark. Empirically, we're running this exact patch in production and for many types of queries, we see a very large decrease in tasks created and time spent fetching from S3. So this is mainly for the use case of short-lived RDDs (so doing .persist doesn't help you) that are backed by data in S3 (so eliminating read time is actually a significant speed up) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 @davies Row-level filtering doesn't occur with the vectorized reader, which is now enabled by default --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 cool, ping to @davies @cloud-fan would either of you be able to look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 addressed typos --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 @HyukjinKwon would you like to file a separate ticket for benchmarking? It's pretty orthogonal to this PR, see rdblue's comment above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 cc @davies @cloud-fan for parquet change, seems I got @rdblue's stamp of approval --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 cc @davies @cloud-fan as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 @rxin Did you get the chance to take a closer look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 Yeah benchmarking is definitely a great idea, as it is likely Spark will be better than Parquet at filtering individual records, but I'm still not quite understanding why this filter is any different and should block on row-by-row filtering decision. _All_ filters are being processed row-by-row using ParquetRecordReader to my understanding, and this one is no different from any of the others in ParquetFilters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 That is true, but currently all filters are being pushed down to row-by-row anyway when not using the vectorized reader, so I'm unclear why the IN filter is special --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to O...
Github user andreweduffy commented on a diff in the pull request: https://github.com/apache/spark/pull/14671#discussion_r75092317 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -369,7 +369,7 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex test("SPARK-11103: Filter applied on merged Parquet schema with new column fails") { import testImplicits._ -Seq("true", "false").map { vectorized => +Seq("true", "false").foreach { vectorized => --- End diff -- Yep, maybe this belongs in a separate PR as a followup to the previous one, I just changed it because Intellij was yelling at me :P --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 Thanks for the comments guys! Had to search through some code, but I think I understand the current state of things. Correct me if I'm wrong, but it seems that record-by-record filtering only occurs if the vectorized reader is disabled, as there is no logic in SpecificParquetRecordReaderBase to perform individual record filtering, so currently this will only be happening if we fallback to the Parquet-provided ParquetRecordReader. After doing some digging into ParquetRecordReader, it pushes down the "record-by-record" filter to InternalParquetRecordReader. However, it appears that inside of the InternalParquetRecordReader, you can actually disable row-by-row filtering, with the magic conf `parquet.filter.record-level.enabled` ([declaration](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L121) and [usage](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L180)) So in theory if we set this configuration to false when we construct ParquetRecordReader in [ParquetFileFormat](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L381), we wouldn't have to worry about row-by-row filtering being applied along either code path (I think). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 cc @HyukjinKwon @rdblue for Parquet-related change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14671: SPARK-17091: ParquetFilters rewrite IN to OR of E...
Github user andreweduffy commented on a diff in the pull request: https://github.com/apache/spark/pull/14671#discussion_r75008463 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -40,20 +40,20 @@ import org.apache.spark.util.{AccumulatorContext, LongAccumulator} * NOTE: * * 1. `!(a cmp b)` is always transformed to its negated form `a cmp' b` by the - *`BooleanSimplification` optimization rule whenever possible. As a result, predicate `!(a < 1)` - *results in a `GtEq` filter predicate rather than a `Not`. + *`BooleanSimplification` optimization rule whenever possible. As a result, predicate + *`!(a < 1)` results in a `GtEq` filter predicate rather than a `Not`. * * 2. `Tuple1(Option(x))` is used together with `AnyVal` types like `Int` to ensure the inferred *data type is nullable. */ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContext { private def checkFilterPredicate( - df: DataFrame, - predicate: Predicate, - filterClass: Class[_ <: FilterPredicate], - checker: (DataFrame, Seq[Row]) => Unit, - expected: Seq[Row]): Unit = { +df: DataFrame, --- End diff -- Sorry about the indentation fixes, I can revert all of these. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14671: SPARK-17091: ParquetFilters rewrite IN to OR of E...
GitHub user andreweduffy opened a pull request: https://github.com/apache/spark/pull/14671 SPARK-17091: ParquetFilters rewrite IN to OR of Eq ## What changes were proposed in this pull request? Allow for pushdown of `IN` clauses. Previous implementations relied upon custom user defined predicates in Parquet, instead here we just convert an IN over a set to an OR over a set of equality expressions, which can be pushed down properly to Parquet. ## How was this patch tested? Unit tests from previous PR's, specifically #10278. They pass with the change and fail when the case block is commented out, indicating the pushdown is successfully being applied in Parquet. Because it is a disjunction of equality checks this should be applied at the row group level. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andreweduffy/spark pushdown-in Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14671.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14671 commit 7679285bceedb41c7d6390c069f0a852804c8cf3 Author: Andrew Duffy <r...@aduffy.org> Date: 2016-08-16T17:57:15Z SPARK-17091: ParquetFilters rewrite IN to OR of Eq --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 (Thanks Sean for whitelisting!) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 cc @rxin @liancheng Let me know what you think --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 Can we rerun tests please? Issue should be fixed now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 Yep, that's it. Effectively it just reduces task overhead for querying large parquet datasets and turns filter queries where the filter never applies into a no-op, no tasks generated (only if summary metadata is available) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14649 Let me see if I can answer these in order: > So, if my understanding is correct, Parquet filters rowgroups in both normal reader and vectorized reader already (#13701). Is this doing the same thing in Spark-side? Yep, this attempts to do Footer processing before launching tasks, if the _metadata file is available. > Also, doesn't this try to touch many files in driver-side? This should only be touching the _metadata file, it reads all the Footers that are collected in there to determine file paths to prune out. It needs the _metadata and not the _common_metadata because _common_metadata doesn't include footer info. > Also, if my understanding is correct, we are picking up only single file to read footer (see ParquetFileFormat.scala#L217-L225) unless we merge schemas. So, it seems, due to this reason, writing _metadata or _common_metadata is disabled (See https://issues.apache.org/jira/browse/SPARK-15719). Yes we have generation of the summary-metadata disabled by default, but it is configurable via `parquet.enable.summary-metadata` conf parameter. If the _metadata file isn't found it simply falls back to the original partitioning structure and does no file pruning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...
Github user andreweduffy commented on a diff in the pull request: https://github.com/apache/spark/pull/14649#discussion_r74860656 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -703,6 +703,16 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } } + + test("SPARK-17059: Allow Allow FileFormat to specify partition pruning strategy") { --- End diff -- Typo will fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14649: [SPARK-17059][SQL] Allow FileFormat to specify pa...
GitHub user andreweduffy opened a pull request: https://github.com/apache/spark/pull/14649 [SPARK-17059][SQL] Allow FileFormat to specify partition pruning strategy ## What changes were proposed in this pull request? Different FileFormat implementations may be able to make intelligent decisions about files that will need to be processed as part of a FileSourceScanExec based on the `pushedFilters` available. This PR implements a way to do that pluggably, with an implementation for Parquet that allows applying the summary metadata to prevent task creation. This has a few performance benefits: 1. Reading of files is generally slow, especially for S3. In the current Parquet implementation the summary metadata is not used and so the footers are read directly. This can be very slow for large Parquet datasets, as even as of Hadoop 2.7 S3 reads will read the entire file by default (random reads are configurable only starting on 2.7 onwards, and is disabled by default) 2. Partitions that are found to contain no files can then be pruned out or coalesced depending on the FileFormat's implementation, allowing for fewer tasks being created. ## How was this patch tested? Existing tests and Spark Shell, plus unit test for the Parquet implementation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andreweduffy/spark parquet-task-pruning Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14649.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14649 commit 46657980c00ce36c03e39540bd4399613f439e04 Author: Andrew Duffy <r...@aduffy.org> Date: 2016-07-27T16:47:20Z Allow FileFormat to specify file pruning --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter pushdow...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14159 Yep, closing now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter ...
Github user andreweduffy closed the pull request at: https://github.com/apache/spark/pull/14159 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter pushdow...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14159 Actually it appears that since this was opened, later PR #14450 fixes this. Should be safe to close now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14201: [SPARK-14702] Make environment of SparkLauncher launched...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14201 Thanks @vanzin ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14201: [SPARK-14702] Make environment of SparkLauncher launched...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14201 Yuck, sorry, switching between projects my style settings were being carried over, that should be it for indentation fixes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter pushdow...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14159 @hvanhovell Anything new on this front? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13953: [SPARK-16265][SUBMIT] Addition of --driver-jre Fl...
Github user andreweduffy closed the pull request at: https://github.com/apache/spark/pull/13953 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14201: [SPARK-14702] Make environment of SparkLauncher launched...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14201 Gotcha, sorry about the force overwrites, in some projects I've worked on that's preferred to keep clean history, though for review I can see why separate units is easier. Thanks again for taking the time to review this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14201: [SPARK-14702] Make environment of SparkLauncher l...
Github user andreweduffy commented on a diff in the pull request: https://github.com/apache/spark/pull/14201#discussion_r71036075 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java --- @@ -359,6 +369,83 @@ public SparkLauncher setVerbose(boolean verbose) { } /** + * Sets the working directory of spark-submit. + * + * @param dir The directory to set as spark-submit's working directory. + * @return This launcher. + */ + public SparkLauncher directory(File dir) { +workingDir = dir; +return this; + } + + /** + * Specifies that stderr in spark-submit should be redirected to stdout. + * + * @return This launcher. + */ + public SparkLauncher redirectError() { +redirectErrorStream = true; +return this; + } + + /** + * Redirects error output to the specified Redirect. + * + * @param to The method of redirection. + * @return This launcher. + */ + public SparkLauncher redirectError(ProcessBuilder.Redirect to) { +errorStream = to; +return this; + } + + /** + * Redirects standard output to the specified Redirect. + * + * @param to The method of redirection. + * @return This launcher. + */ + public SparkLauncher redirectOutput(ProcessBuilder.Redirect to) { +outputStream = to; +return this; + } + + /** + * Redirects error output to the specified File. + * + * @param errFile The file to which stderr is written. + * @return This launcher. + */ + public SparkLauncher redirectError(File errFile) { +errorStream = ProcessBuilder.Redirect.to(errFile); +return this; + } + + /** + * Redirects error output to the specified File. + * + * @param outFile The file to which stdout is written. + * @return This launcher. + */ + public SparkLauncher redirectOutput(File outFile) { +outputStream = ProcessBuilder.Redirect.to(outFile); +return this; + } + + /** + * Sets all output to be logged and redirected to a logger with the specified name. + * + * @param loggerName The name of the logger to log stdout and stderr. + * @return This launcher. + */ + public SparkLauncher redirectToLog(String loggerName) { --- End diff -- Yep, that's what I'm saying with the second point above, so those two should probably be good enough --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14201: [SPARK-14702] Make environment of SparkLauncher l...
Github user andreweduffy commented on a diff in the pull request: https://github.com/apache/spark/pull/14201#discussion_r71031747 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java --- @@ -359,6 +369,83 @@ public SparkLauncher setVerbose(boolean verbose) { } /** + * Sets the working directory of spark-submit. + * + * @param dir The directory to set as spark-submit's working directory. + * @return This launcher. + */ + public SparkLauncher directory(File dir) { +workingDir = dir; +return this; + } + + /** + * Specifies that stderr in spark-submit should be redirected to stdout. + * + * @return This launcher. + */ + public SparkLauncher redirectError() { +redirectErrorStream = true; +return this; + } + + /** + * Redirects error output to the specified Redirect. + * + * @param to The method of redirection. + * @return This launcher. + */ + public SparkLauncher redirectError(ProcessBuilder.Redirect to) { +errorStream = to; +return this; + } + + /** + * Redirects standard output to the specified Redirect. + * + * @param to The method of redirection. + * @return This launcher. + */ + public SparkLauncher redirectOutput(ProcessBuilder.Redirect to) { +outputStream = to; +return this; + } + + /** + * Redirects error output to the specified File. + * + * @param errFile The file to which stderr is written. + * @return This launcher. + */ + public SparkLauncher redirectError(File errFile) { +errorStream = ProcessBuilder.Redirect.to(errFile); +return this; + } + + /** + * Redirects error output to the specified File. + * + * @param outFile The file to which stdout is written. + * @return This launcher. + */ + public SparkLauncher redirectOutput(File outFile) { +outputStream = ProcessBuilder.Redirect.to(outFile); +return this; + } + + /** + * Sets all output to be logged and redirected to a logger with the specified name. + * + * @param loggerName The name of the logger to log stdout and stderr. + * @return This launcher. + */ + public SparkLauncher redirectToLog(String loggerName) { --- End diff -- So I had an extra check in the last commit, thought there was a comment that it should be removed though, maybe I should just clarify here. I see 2 possible checks that we should be doing here: * Ensure that only one of `redirectError(...)` and `redirectError()` can be used (done) * Ensure that `redirectToLog()` is never used with any other `redirect*` method (removed in this diff, can be added back) Are there others that I'm missing? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter pushdow...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14159 bump? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14201: [SPARK-14702] Make environment of SparkLauncher l...
Github user andreweduffy commented on a diff in the pull request: https://github.com/apache/spark/pull/14201#discussion_r70905418 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java --- @@ -359,6 +364,82 @@ public SparkLauncher setVerbose(boolean verbose) { } /** + * Sets the working directory of the driver process. + * @param dir The directory to set as the driver's working directory. + * @return This launcher. + */ + public SparkLauncher directory(File dir) { +builder.workingDir = dir; +return this; + } + + /** + * Specifies that stderr in the driver should be redirected to stdout. + * @return This launcher. + */ + public SparkLauncher redirectError() { +builder.redirectErrorStream = true; +return this; + } + + /** + * Redirects error output to the specified Redirect. + * @param to The method of redirection. + * @return This launcher. + */ + public SparkLauncher redirectError(ProcessBuilder.Redirect to) { +builder.errorStream = to; +return this; + } + + /** + * Redirects standard output to the specified Redirect. + * @param to The method of redirection. + * @return This launcher. + */ + public SparkLauncher redirectOutput(ProcessBuilder.Redirect to) { +builder.outputStream = to; +return this; + } + + /** + * Redirects error output to the specified File. + * @param errFile The file to which stderr is written. + * @return This launcher. + */ + public SparkLauncher redirectError(File errFile) { +builder.errorStream = ProcessBuilder.Redirect.to(errFile); +return this; + } + + /** + * Redirects error output to the specified File. + * @param outFile The file to which stdout is written. + * @return This launcher. + */ + public SparkLauncher redirectOutput(File outFile) { +builder.outputStream = ProcessBuilder.Redirect.to(outFile); +return this; + } + + /** + * Sets all output to be logged and redirected to a logger with the specified name. + * @param loggerName The name of the logger to log stdout and stderr. + * @return This launcher. + */ + public SparkLauncher redirectToLog(String loggerName) { +try { + // NOTE: the below ordering is important, so builder.redirectToLog is only set to true iff + // the preceding put() finishes without exception. + builder.getEffectiveConfig().put(CHILD_PROCESS_LOGGER_NAME, loggerName); --- End diff -- Should I also modify `startApplication` to read from builder.conf? It appears to use `builder.getEffectiveConfig()` which as far as I can tell is sourced from a properties file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14201: [SPARK-14702] Make environment of SparkLauncher l...
Github user andreweduffy commented on a diff in the pull request: https://github.com/apache/spark/pull/14201#discussion_r70904921 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java --- @@ -82,8 +83,12 @@ /** Used internally to create unique logger names. */ private static final AtomicInteger COUNTER = new AtomicInteger(); + public static final ThreadFactory REDIRECTOR_FACTORY = new NamedThreadFactory("launcher-proc-%d"); --- End diff -- How should this be shared do you think without making it public static in either SparkLauncher or ChildProcApphandle? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14201: [SPARK-14702] Make environment of SparkLauncher l...
Github user andreweduffy commented on a diff in the pull request: https://github.com/apache/spark/pull/14201#discussion_r70888965 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java --- @@ -82,8 +83,12 @@ /** Used internally to create unique logger names. */ private static final AtomicInteger COUNTER = new AtomicInteger(); + public static final ThreadFactory REDIRECTOR_FACTORY = new NamedThreadFactory("launcher-proc-%d"); + static final Map<String, String> launcherConfig = new HashMap<>(); + private OutputRedirector redirector; // Holds the possible OutputRedirector --- End diff -- alright that was a concern of mine, will remove it in next patch --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14201: [SPARK-14702] Expose SparkLauncher's ProcessBuilder for ...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14201 I'll go ahead and replace use of Throwables with throw new RTE, not clear why it can't find Guava on the classpath however --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14185: [SPARK-14702][SUBMIT] Expose SparkLauncher's Proc...
Github user andreweduffy closed the pull request at: https://github.com/apache/spark/pull/14185 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14201: [SPARK-14702] Expose SparkLauncher's ProcessBuilder for ...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14201 Replaces #14185, @vanzin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14201: [SPARK-14702] Expose SparkLauncher's ProcessBuild...
GitHub user andreweduffy opened a pull request: https://github.com/apache/spark/pull/14201 [SPARK-14702] Expose SparkLauncher's ProcessBuilder for user flexibility ## What changes were proposed in this pull request? Adds a few public methods to `SparkLauncher` to allow configuring some extra features of the `ProcessBuilder`, including the working directory, output and error stream redirection. ## How was this patch tested? Unit testing + simple Spark driver programs You can merge this pull request into a Git repository by running: $ git pull https://github.com/andreweduffy/spark feature/launcher Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14201.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14201 commit 5a35963bc4661193d8c169c7f98fffe24b181124 Author: Andrew Duffy <adu...@palantir.com> Date: 2016-07-14T11:57:15Z Make environment of SparkLauncher launched process more configurable --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14185: [SPARK-14702][SUBMIT] Expose SparkLauncher's ProcessBuil...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14185 That sounds reasonable, I can take a shot at that today, I'll submit it as a second PR and then we can close this one out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14185: [SPARK-16511][SUBMIT] Expose SparkLauncher's Proc...
GitHub user andreweduffy opened a pull request: https://github.com/apache/spark/pull/14185 [SPARK-16511][SUBMIT] Expose SparkLauncher's ProcessBuilder for user ## What changes were proposed in this pull request? Expose the `ProcessBuilder` in `SparkLauncher`, as well as opening up `SparkLauncher#startApplication` to allow accepting arbitrary ProcessBuilders. For more info see [https://issues.apache.org/jira/browse/SPARK-16511](this JIRA ticket) Also covers [https://issues.apache.org/jira/browse/SPARK-14702](SPARK-14702) ## How was this patch tested? Unit tests + spark shell You can merge this pull request into a Git repository by running: $ git pull https://github.com/andreweduffy/spark feature/processbuilder Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14185.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14185 commit 7fe36f5970e7e577a47d8b6a7534cc95d22a94c2 Author: Andrew Duffy <adu...@palantir.com> Date: 2016-07-13T14:20:54Z [SPARK-16511][SUBMIT] Expose SparkLauncher's ProcessBuilder for user flexibility Also covers https://issues.apache.org/jira/browse/SPARK-16511 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14159: [SQL][PARQUET] Fix for Vectorized Parquet filter pushdow...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14159 Yep, looks like the other one was closed by the committer. I saw Sean commented that this might need to be tested against 2.2, is that going to be necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14159: [PARQUET] Fix for Parquet filter pushdown
GitHub user andreweduffy opened a pull request: https://github.com/apache/spark/pull/14159 [PARQUET] Fix for Parquet filter pushdown ## What changes were proposed in this pull request? Fix parquet filter pushdown from not reaching all the way down to the file level Use of previous deprecated constructor defaults to null metadata, which prevents pushdown from reaching the Parquet level. ## How was this patch tested? Looking at output of collects from SparkShell, before were printing warnings about CorruptStatistics, preventing pushing down filters to individual parquet files. Now able to use the metadata in each file to pushdown. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/andreweduffy/spark bugfix/pushdown Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14159.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14159 commit f825ad709cdc3c89d0cc7e41d0410998e6cc7541 Author: Andrew Duffy <adu...@palantir.com> Date: 2016-07-12T19:41:22Z Fix for Parquet filter pushdown Use of previous deprecated constructor defaults to null metadata, which prevents pushdown from reaching the Parquet level. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13953: [SPARK-16265][SUBMIT] Addition of --driver-jre Flag to S...
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/13953 @petermaxlee can you give an example of how it can "get into a lot of problems"? The use case seems weird if you're usually developing on hardware that you control, but makes more sense if you're developing a Spark application and then want to give it to someone to run on their own cluster. It integrates with `SparkLauncher` so that you can spawn cluster jobs that use the correct version of Java without forcing the recipient of the app to futz with their cluster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13953: [SPARK-16265][SUBMIT] Addition of --driver-jre Fl...
GitHub user andreweduffy opened a pull request: https://github.com/apache/spark/pull/13953 [SPARK-16265][SUBMIT] Addition of --driver-jre Flag to SparkSubmit ## What changes were proposed in this pull request? 1. Addition of `--driver-jre` flag to `SparkSubmitArguments` 2. Now when `--driver-jre` is specified to `SparkSubmit` and it is connecting to a YARN master, the driver will zip up a copy of its in-use JRE and ship it to the YARN cluster as part of `spark.yarn.dist.archives`. Then by setting `spark.executorEnv.JAVA_HOME` and `spark.appMasterEnv.JAVA_HOME` to point to their local copies, use of specific versions of Java can be had even on YARN clusters that do not have those versions pre-installed. ## How was this patch tested? Compiled the Spark examples with Java 8, did a Java 8 `spark-submit --driver-jre` of `SparkPi` to a Java 7 YARN cluster, executors and AM used Java 8 without issue. Running without shipping the JRE results in errors due to Java version differences. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andreweduffy/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13953.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13953 commit abd80d1bacd66eea6dfeb33fde3caa34bd7c9c7d Author: Andrew Duffy <r...@aduffy.org> Date: 2016-06-28T10:44:46Z Addition of --driver-jre Flag to SparkSubmit --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org