[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...
Github user rajeshbalamohan closed the pull request at: https://github.com/apache/spark/pull/19184 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/19184 Thanks @mridulm , @jerryshao , @viirya . closing this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/19184 Thanks @viirya . I have updated the patch to address your comments. This fixes the "too many files open" issue for (e.g Q67, Q72, Q14 etc) which involves window functions; but for the merger the issue needs to be addressed still. Agreed that this would be partial patch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/19184#discussion_r137973976 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java --- @@ -104,6 +124,10 @@ public void loadNext() throws IOException { if (taskContext != null) { taskContext.killTaskIfInterrupted(); } +if (this.din == null) { + // Good time to init (if all files are opened, we can get Too Many files exception) + initStreams(); +} --- End diff -- Good point. PR has been tried with queries involving window functions (e.g Q67) for which it worked fine. During spill merges (esp getSortedIterator), it is possible to encounter too many open files issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/19184 I got into this with the limit of 32K. "unlimited" is another option which can be a workaround for this. But that may not be a preferable option in production systems. For e.g, with Q67 I observed 9000+ spill files in the task. And with multiple tasks per executor, it ended up easily reaching the limits. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/19184 [SPARK-21971][CORE] Too many open files in Spark due to concurrent fi⦠â¦les being opened ## What changes were proposed in this pull request? In UnsafeExternalSorter::getIterator(), for every spillWriter a file is opened in UnsafeSorterSpillReader and these files get closed later point in time as a part of close() call. However, when large number of spill files are present, number of files opened increases to a great extent and ends up throwing "Too many files" open exception. This can easily be reproduced with TPC-DS Q67 at 1 TB scale in multi node cluster with multiple cores per executor. There are ways to reduce the number of spill files that are generated in Q67. E.g, increase "spark.sql.windowExec.buffer.spill.threshold" where 4096 is the default. Another option is to increase ulimit to much higher values. But those are workarounds. This PR reduces the number of files that are kept open at in UnsafeSorterSpillReader. ## How was this patch tested? Manual testing of Q67 in 1 TB and 10 TB scale on multi node cluster. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-21971 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19184.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19184 commit dcc2960d5f60add9bfd9446df59b0d0d06365947 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2017-09-11T01:36:12Z [SPARK-21971][CORE] Too many open files in Spark due to concurrent files being opened --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Use metastore schema instead of infer...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 @cloud-fan . Failure is related to the parquet changes introduced for returning metastoreSchema (it has issues with complex types). I am not very comfortable with the Parquet codepath. For time being, I would revert back the last change. We can create subsequent jira if needed for parq related changes; Alternatively I am fine with someone who is comfortable with parq code taking over this as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14537: [SPARK-16948][SQL] Use metastore schema instead o...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/14537#discussion_r79972251 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -237,21 +237,27 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log new Path(metastoreRelation.catalogTable.storage.locationUri.get), partitionSpec) -val inferredSchema = if (fileType.equals("parquet")) { - val inferredSchema = -defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()) - inferredSchema.map { inferred => -ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, inferred) - }.getOrElse(metastoreSchema) -} else { - defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()).get +val schema = fileType match { + case "parquet" => +val inferredSchema = + defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()) + +// For Parquet, get correct schema by merging Metastore schema data types --- End diff -- Sure. Will change to return metastoreSchema for parq as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Use metastore schema instead of infer...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 @cloud-fan >> For branch 2.0, we should open another PR to fix the OrcFileFormat.inferSchema, to not throw FileNotFoundException for empty table. >> Code for not throwing FileNotFoundException in OrcFileFormat.inferSchema was removed from this patch. I can create separate JIRA for that; plz let me know if that is blocking this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Use metastore schema instead of infer...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 Sorry about the delay. Updated the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 Thanks @gatorsmile . Removed the changes related to OrcFileFormat --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 Fixed the test case name. I haven't changed the parquet code path as I wasn't sure on whether it would break any backward compatibility. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 Thanks @gatorsmile, it would be good to retain the change in OrcFileInputFormat's inferschema (just in case it is referenced later). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/14537#discussion_r76179877 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -54,10 +57,12 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable sparkSession: SparkSession, options: Map[String, String], files: Seq[FileStatus]): Option[StructType] = { -OrcFileOperator.readSchema( - files.map(_.getPath.toUri.toString), - Some(sparkSession.sessionState.newHadoopConf()) -) +// Safe to ignore FileNotFoundException in case no files are found. +val schema = Try(OrcFileOperator.readSchema( --- End diff -- Yes, in case this is referred anytime later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 ok, reverted the changes related to physical schema changes. In both cases, it returns metastoreschema, and mismatches can be handled separately. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 For non-partitioned ORC, it is currently using the metastore schema and is not inferring the schema currently in HiveMetastoreCatalog, and hence not an issue. But the problem of wrong mapping (i.e physical col name in file being different than that of metastore) still exists. The more I see it, it would be easier to club the patches in this JIRA itself. If so, HiveMetastoreCatalog could just rely on metastoreSchema and ORCFileFormat can later do the mapping if the mappings are different. I will revise the patch to include this scenario and post it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/14537#discussion_r75967137 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -237,21 +237,26 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log new Path(metastoreRelation.catalogTable.storage.locationUri.get), partitionSpec) -val inferredSchema = if (fileType.equals("parquet")) { - val inferredSchema = -defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()) - inferredSchema.map { inferred => -ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, inferred) - }.getOrElse(metastoreSchema) -} else { - defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()).get +val inferredSchema = + defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()) +val schema = fileType match { + case "parquet" => +// For Parquet, get correct schema by merging Metastore schema data types +// and Parquet schema field names. +inferredSchema.map { schema => + ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, schema) +}.getOrElse(metastoreSchema) + case "orc" => +inferredSchema.getOrElse(metastoreSchema) + case _ => +inferredSchema.get --- End diff -- Thanks @mallman . Addressed this in latest commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/14537#discussion_r75902767 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -237,21 +237,26 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log new Path(metastoreRelation.catalogTable.storage.locationUri.get), partitionSpec) -val inferredSchema = if (fileType.equals("parquet")) { - val inferredSchema = -defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()) - inferredSchema.map { inferred => -ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, inferred) - }.getOrElse(metastoreSchema) -} else { - defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()).get +val inferredSchema = + defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()) +val schema = fileType match { + case "parquet" => +// For Parquet, get correct schema by merging Metastore schema data types +// and Parquet schema field names. +inferredSchema.map { schema => + ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, schema) +}.getOrElse(metastoreSchema) + case "orc" => +inferredSchema.getOrElse(metastoreSchema) + case _ => +inferredSchema.get --- End diff -- Not sure if exception has to be thrown in this case. Or just return null in this case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 Thanks @gatorsmile. Addressed review comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 For latest ORC, if the data was written out by Hive, it would have the same mapping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 Right, for Parquet this could be part of initial codebase (from Spark-1251 I believe) which merges any metastore conflicts with parq files. But in the case of ORC, this inference is still valid as the column names stored in old ORC format could be different from that of Hive Metastore (e.g HIVE-4243). There is a separate PR:https://github.com/apache/spark/pull/14471 which track the ORC compatibility issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 Thanks @rxin . Incorporated review comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 @rxin Can you please review when you find time? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 Thank you thejas and @mallman --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 @tejasapatil, @mallman - Can you please review when you find time? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14537 Thanks @mallman . Fixed review comments in latest commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10846: [SPARK-12920][SQL] Fix high CPU usage in spark thrift se...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/10846 They take longer to clean up. If queries are executed continuously, major portion of thrift server wastes time in GC-ing. IAC, I have removed the HadoopRDD in the recent commit and can be tracked in separate JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10846: [SPARK-12920][SQL] Fix high CPU usage in spark thrift se...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/10846 SoftRef causes lots of mem-pressure on thrift server. To be precise, when executing query with large dataset, it can very soon run at 1200% CPU and all threads carrying out just GC activities. That is for the HadoopRDD conf caching. Due to softRef they reach till GC threshold and gets cleared up. It does not OOM, but runs at very high CPU due to GC. JobProgress* does not cleanup the data fast enough in some cases (e.g too many queries are executed continuously) and in such cases the memory pressure on thrift server increases. Both of them contribute to the high CPU usage. I am afraid that fixing one of them would still have the high-CPU usage issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/14537 [SPARK-16948][SQL] Querying empty partitioned orc tables throws excep⦠## What changes were proposed in this pull request? Querying empty partitioned ORC tables from spark-sql throws exception with `spark.sql.hive.convertMetastoreOrc=true`. This is due to the fact that the inferschema() would end up throwing `FileNotFoundException` when no files are present in partitioned orc tables. Patch attempts to fix it, wherein it would fall back to metastore based schema information. ## How was this patch tested? Included unit tests and also tested it in small scale cluster. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) â¦tion You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-16948 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14537.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14537 commit 5721b88c7c816f57ef39374ac9b335d870543628 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-08-08T08:28:23Z [SPARK-16948][SQL] Querying empty partitioned orc tables throws exception --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10846: [SPARK-12920][SQL] Fix high CPU usage in spark thrift se...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/10846 - Rebased to master and changed title. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10846: SPARK-12920. [SQL]. Spark thrift server can run at very ...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/10846 Sorry about the delay. Missed this one. Haven't tested this recently. But yes, this would be a problem in master as well. Please let me know if i need to rebase this for master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14471: [SPARK-14387][SQL] Enable Hive-1.x ORC compatibility wit...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14471 Thanks @rxin. Changes: 1. Added test case. Also added sample orc file (392 bytes) from Hive 1.x with format "Type: struct<_col0:int,_col1:string>". Without this PR change in OrcFileFormat, it would end up throwing "java.lang.IllegalArgumentException: Field "key" does not exist." for the same test case. 2. Fixed the title of the JIRA and the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14471: [SPARK-14387][SQL] Exceptions thrown when querying ORC t...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/14471 Fixed scalastyle issues --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12293: [SPARK-14387][SQL] Exceptions thrown when queryin...
Github user rajeshbalamohan closed the pull request at: https://github.com/apache/spark/pull/12293 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12293: [SPARK-14387][SQL] Exceptions thrown when querying ORC t...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/12293 @yuananf Thanks for trying it out. I have rebased it and created https://github.com/apache/spark/pull/14471. Closing this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14471: [SPARK-14387][SQL] Exceptions thrown when queryin...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/14471 [SPARK-14387][SQL] Exceptions thrown when querying ORC tables ## What changes were proposed in this pull request? This PR improves ORCFileFormat to handle cases when schema stored in the ORC file does not match the schema stored in metastore. ORC Data written by Hive-1.x had virtual column names (HIVE-4243). This is fixed in Hive-2.x, but for data stored using Hive-1.x spark would throw exceptions. To mitigate this, "spark.sql.hve.convertMetastoreOrc" was disabled via SPARK-15705. However, that would incur performance penalties as it would go via HiveTableScan and HadoopRDD. This PR fixes this issue. Related tickets: SPARK-15705 : Change the default value of spark.sql.hive.convertMetastoreOrc to false. SPARK-15705 : Spark won't read ORC schema from metastore for partitioned tables SPARK-16628 : OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files ## How was this patch tested? Manual testing by setting "spark.sql.hve.convertMetastoreOrc=true" and querying data stored via Hive-1.x in ORC format. Also ran unit-tests related to sql. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-14387.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14471.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14471 commit dc943a445047a21a88ab19566eab672e8921dcc1 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-08-03T02:21:05Z [SPARK-14387][SQL] Exceptions thrown when querying ORC tables --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13522: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/13522 Thank you. I have pushed the fixes in the recent commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12105: [SPARK-14321][SQL] Reduce date format cost and st...
Github user rajeshbalamohan closed the pull request at: https://github.com/apache/spark/pull/12105 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12105: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/12105 Patch went stale for master branch and got little messy in my system. I have created https://github.com/apache/spark/pull/13522 which is rebased to master. Will close this after view. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12105: [SPARK-14321][SQL] Reduce date format cost and st...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/12105#discussion_r65885590 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -391,21 +393,24 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { case StringType if right.foldable => val sdf = classOf[SimpleDateFormat].getName val fString = if (constFormat == null) null else constFormat.toString -val formatter = ctx.freshName("formatter") if (fString == null) { s""" boolean ${ev.isNull} = true; ${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)}; """ } else { + val formatter = ctx.freshName("formatter") + ctx.addMutableState(sdf, formatter, s"""$formatter = null;""") --- End diff -- yes. Creating the formatter here did not create any issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13522: [SPARK-14321][SQL] Reduce date format cost and string-to...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/13522 @cloud-fan - Sorry about the delay. Rebased SPARK-14321 for master. https://github.com/apache/spark/pull/12105 had become stale and got little messy in my system. Ended up creating this PR. I will close the earlier one after review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13522: [SPARK-14321][SQL] Reduce date format cost and st...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/13522 [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in⦠## What changes were proposed in this pull request? Here is the generated code snippet when executing date functions. SimpleDateFormat is fairly expensive and can show up bottleneck when processing millions of records. It would be better to instantiate it once. ``` /* 066 */ UTF8String primitive5 = null; /* 067 */ if (!isNull4) { /* 068 */ try { /* 069 */ primitive5 = UTF8String.fromString(new java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format( /* 070 */ new java.util.Date(primitive7 * 1000L))); /* 071 */ } catch (java.lang.Throwable e) { /* 072 */ isNull4 = true; /* 073 */ } /* 074 */ } ``` With modified code, here is the generated code ``` /* 010 */ private java.text.SimpleDateFormat sdf2; /* 011 */ private UnsafeRow result13; /* 012 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder bufferHolder14; /* 013 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter15; /* 014 */ ... ... /* 065 */ boolean isNull0 = isNull3; /* 066 */ UTF8String primitive1 = null; /* 067 */ if (!isNull0) { /* 068 */ try { /* 069 */ if (sdf2 == null) { /* 070 */ sdf2 = new java.text.SimpleDateFormat("-MM-dd HH:mm:ss"); /* 071 */ } /* 072 */ primitive1 = UTF8String.fromString(sdf2.format( /* 073 */ new java.util.Date(primitive4 * 1000L))); /* 074 */ } catch (java.lang.Throwable e) { /* 075 */ isNull0 = true; /* 076 */ } /* 077 */ } ``` Similarly Calendar.getInstance was used in DateTimeUtils which can be lazily inited. ## How was this patch tested? org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite,org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-14321-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13522.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13522 commit 602d4a70ba845df3160a07c2c9afe2d5c3c574c4 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-06-06T12:54:02Z [SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date functions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14321][SQL] Reduce date format cost and...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12105#issuecomment-222408035 Sorry about the delay in responding to this. Will try to rebase and post the patch asap. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12998 [SQL]. Enable OrcRelation when con...
Github user rajeshbalamohan closed the pull request at: https://github.com/apache/spark/pull/10938 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14387][SQL] Exceptions thrown when quer...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12293#issuecomment-216082665 \cc @liancheng , @rxin - Can you please review when you find time? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...
Github user rajeshbalamohan closed the pull request at: https://github.com/apache/spark/pull/11978 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/11978#issuecomment-214705705 @srowen - With the master code base & the changes that went in (FileSourceStrategy to be specific), this PR would no longer be very relevant in master codebase. This would be more relevant for 1.6.x line, but not sure if we need to backport it. Will mark it as closed now. Plz let me know and I can close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14521][SQL] StackOverflowError in Kryo ...
Github user rajeshbalamohan closed the pull request at: https://github.com/apache/spark/pull/12514 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14752][SQL] LazilyGenerateOrdering thro...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/12661 [SPARK-14752][SQL] LazilyGenerateOrdering throws NullPointerException ## What changes were proposed in this pull request? LazilyGenerateOrdering throws NullPointerException when clubbed with TakeOrderedAndProjectExec. This causes simple queries like "select i_item_id from item order by i_item_id limit 10;" would fail in spark-sql. When deserializing in DirectTaskResult, it goes through nested structure in Kryo causing NPE for generatedOrdering. ## How was this patch tested? Manual testing by running multiple SQL queries in multi node cluster. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-14752 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12661.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12661 commit e83c9bc87acc794ca3e9a37c999c05550d425e2b Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-04-25T13:09:58Z [SPARK-14752][SQL] LazilyGenerateOrdering throws NullPointerException with TakeOrderedAndProject --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12319#issuecomment-214132553 Thanks @liancheng , @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14387][SQL] Exceptions thrown when quer...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12293#issuecomment-213275126 \cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14521][SQL] StackOverflowError in Kryo ...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12514#issuecomment-213215983 sure @yzhou2001 . Please go ahead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14387][SQL] Exceptions thrown when quer...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12293#issuecomment-213207842 Changes: - Rebased patch to master branch - Removed OrcTableScan as it is not used anywhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12319#issuecomment-212766510 Thanks for the review @liancheng Latest commit addresses the review comments. Changes are as follows - Moved OrcRecordReader changes to SparkOrcNewRecordReader in spark-hive - Removed pom.xml related changes - Fixed styling issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12319#issuecomment-212393980 Thanks for the review @liancheng . Should i create separate PR for OrcRecordReader in https://github.com/pwendell/hive/tree/release-1.2.1-spark providing reference to this ticket? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14521][SQL] StackOverflowError in Kryo ...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12514#issuecomment-212392651 Sure, will check on removing the circular reference. Took the reference tracking approach, as it is enabled by default with Spark's KryoSerializer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14521][SQL] StackOverflowError in Kryo ...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12514#issuecomment-212191093 \cc @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14521][SQL] StackOverflowError in Kryo ...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/12514 [SPARK-14521][SQL] StackOverflowError in Kryo when executing TPC-DS Q⦠## What changes were proposed in this pull request? Observed stackOverflowError in Kryo when executing TPC-DS Query27. Spark thrift server disables kryo reference tracking (if not specified in conf). When "spark.kryo.referenceTracking" is set to true explicitly in spark-defaults.conf, query executes successfully. Recent changes HashedRelation could have introduced loops which would need "spark.kryo.referenceTracking=true" in spark-thrift server. This PR addresses this by setting referenceTracking to true in SparkSQLEnv ## How was this patch tested? Manually running TPC-DS queries at 200 GB scale in multi node cluster. Also ran org.apache.spark.sql.hive.execution.HiveCompatibilitySuite,org.apache.spark.sql.hive.execution.HiveQuerySuite,org.apache.spark.sql.hive.execution.PruningSuite,org.apache.spark.sql.hive/CachedTableSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-14521 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12514.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12514 commit 59875f424aaf60aa90ca5a1006df8b5e20d4f83a Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-04-20T00:59:32Z [SPARK-14521][SQL] StackOverflowError in Kryo when executing TPC-DS Query27 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12105#issuecomment-211286426 In the generated code, it returns null if constFormat == null. So it is not required to change the generated code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/11978#issuecomment-211175800 @srowen - As per andrew's comment, I thought it was fine to make the change given that HadoopRDD is marked as DeveloperAPI. Please let me know if any additional changes are needed. Additional info: Huge amount of changes in SPARK-13664 for FileSourceStrategy which is marked as the default codepath. So ideally, OrcRelation would no longer go via this codepath by default. Given that, this PR would have an impact if someone is trying to directly invoke HadoopRDD and has done closure clearing upfront. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/12105#discussion_r60001514 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -368,7 +369,10 @@ abstract class UnixTime extends BinaryExpression with ExpectsInputTypes { t.asInstanceOf[Long] / 100L case StringType if right.foldable => if (constFormat != null) { -Try(new SimpleDateFormat(constFormat.toString).parse( +if (formatter == null) { + formatter = Try(new SimpleDateFormat(constFormat.toString)).getOrElse(null) --- End diff -- Didn't want to throw the error back as it would break the earlier functionality. Eearlier it was returning null when any exception (i.e, could be constFormat being null, or parsing error) was thrown. Creating the formatter upfront in the recent commit and handling null earlier itself, to have minimal changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12105#issuecomment-210384204 Revised the patch addressing comments. Fixed eval() of UnixTime, FromUnixTime. Haven't changed eval in DateFormatClass as i am not sure if format can change in between. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12105#issuecomment-210202377 Sorry about the delay. I will share the update patch today --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/12319#discussion_r59328799 --- Diff: sql/core/src/main/java/org/apache/hadoop/hive/ql/io/orc/OrcRecordReader.java --- @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.io.orc; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +import java.io.IOException; +import java.util.List; + +public class OrcRecordReader extends RecordReader<NullWritable, OrcStruct> { --- End diff -- Sure. This is based on OrcNewInputFormat.OrcRecordReader (which is marked private). Only addition is the getObjectInspector targeted to reduce namenode calls later. I will update the doc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12319#issuecomment-208694717 Sure @rxin. makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14551. [SQL] Reduce number of NN calls i...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/12319 SPARK-14551. [SQL] Reduce number of NN calls in OrcRelation with File⦠## What changes were proposed in this pull request? When FileSourceStrategy is used, record reader is created which incurs a NN call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading the file information to get the ObjectInspector. This incurs additional NN call. It would be good to avoid this additional NN call (specifically for partitioned datasets). Added OrcRecordReader which is very similar to OrcNewInputFormat.OrcRecordReader with an option of exposing the ObjectInspector. This eliminates the need to look up the file later for generating the object inspector. This would be specifically be useful for partitioned tables/datasets. ## How was this patch tested? Ran tpc-ds queries manually and also verified by running org.apache.spark.sql.hive.orc.OrcSuite,org.apache.spark.sql.hive.orc.OrcQuerySuite,org.apache.spark.sql.hive.orc.OrcPartitionDiscoverySuite,OrcPartitionDiscoverySuite.OrcHadoopFsRelationSuite,org.apache.spark.sql.hive.execution.HiveCompatibilitySuite â¦SourceStrategy mode You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-14551 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12319.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12319 commit 1b99d95e3361ed526a93abc6a3e1c93e6de578de Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-04-12T02:30:26Z SPARK-14551. [SQL] Reduce number of NN calls in OrcRelation with FileSourceStrategy mode --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14387. [SQL] Exceptions thrown when quer...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/12293 SPARK-14387. [SQL] Exceptions thrown when querying ORC tables stored ⦠## What changes were proposed in this pull request? Physical files stored in Hive as ORC would have internal columns as _col1,_col2 etc and column mapping would be available in HiveMetastore. It was possible to query ORC tables stored in Hive via Spark's beeline client in earlier branches, and with master branch this was broken. When reading ORC files, it would be good map hive schema to physical schema for supporting backward compatibility. This PR addresses this issue. ## How was this patch tested? Manual execution of TPC-DS queries at 200 GB scale. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) â¦in Hive You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-14387 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12293.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12293 commit 1bc4e98ff19e76a2302003268c7a2c374647aad3 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-04-11T05:53:24Z SPARK-14387. [SQL] Exceptions thrown when querying ORC tables stored in Hive --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12105#issuecomment-205122837 Agreed. Thanks @srowen . Reverted calendar changes in DateTimeUtils in recent commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/12105#issuecomment-205082821 SDF declared in the generated code is not shared in multiple threads. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/11978#issuecomment-204307805 @andrewor14 - Not sure if I understood your last comment. Currently no direct invocation to HadoopRDD (with initLocalJobConfFuncOpt) is made in Spark. Later point in time, if change is needed to invoke HadoopRDD (with initLocalJobConfFuncOpt) via SparkContext, following method could be added which cleans up the function. ``` def hadoopRDD[K, V]( broadcastedConf: Broadcast[SerializableConfiguration], initLocalJobConfFuncOpt: Option[JobConf => Unit], inputFormatClass: Class[_ <: InputFormat[K, V]], keyClass: Class[K], valueClass: Class[V], minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope { assertNotStopped() clean(initLocalJobConfFuncOpt) new HadoopRDD(this, broadcastedConf, initLocalJobConfFuncOpt, inputFormatClass, keyClass, valueClass, minPartitions) } ``` But, I am not sure whether we need to clean sc.hadoopRDD in this patch. Please let me know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/12105 SPARK-14321. [SQL] Reduce date format cost and string-to-date cost i⦠## What changes were proposed in this pull request? Here is the generated code snippet when executing date functions. SimpleDateFormat is fairly expensive and can show up bottleneck when processing millions of records. It would be better to instantiate it once. ``` /* 066 */ UTF8String primitive5 = null; /* 067 */ if (!isNull4) { /* 068 */ try { /* 069 */ primitive5 = UTF8String.fromString(new java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format( /* 070 */ new java.util.Date(primitive7 * 1000L))); /* 071 */ } catch (java.lang.Throwable e) { /* 072 */ isNull4 = true; /* 073 */ } /* 074 */ } ``` With modified code, here is the generated code ``` /* 010 */ private java.text.SimpleDateFormat sdf2; /* 011 */ private UnsafeRow result13; /* 012 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder bufferHolder14; /* 013 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter15; /* 014 */ ... ... /* 065 */ boolean isNull0 = isNull3; /* 066 */ UTF8String primitive1 = null; /* 067 */ if (!isNull0) { /* 068 */ try { /* 069 */ if (sdf2 == null) { /* 070 */ sdf2 = new java.text.SimpleDateFormat("-MM-dd HH:mm:ss"); /* 071 */ } /* 072 */ primitive1 = UTF8String.fromString(sdf2.format( /* 073 */ new java.util.Date(primitive4 * 1000L))); /* 074 */ } catch (java.lang.Throwable e) { /* 075 */ isNull0 = true; /* 076 */ } /* 077 */ } ``` Similarly Calendar.getInstance was used in DateTimeUtils which can be lazily inited. ## How was this patch tested? org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite,org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite Also tried with couple of sample SQL queries with single executor (6GB) which showed good improvement with the fix. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-14321 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12105.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12105 commit 6fd07db11b5c9eed795dde11177f1c245a6fef16 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-04-01T02:41:07Z SPARK-14321. [SQL] Reduce date format cost and string-to-date cost in date functions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/11978#issuecomment-204200260 Thanks @andrewor14 . Addressed your review comments in latest commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/11978#discussion_r57537799 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -979,6 +979,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it. val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration)) val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path) +clean(setInputPathsFunc) --- End diff -- Thanks @srowen. Yes, for invocations via sc.textFile. Adding additional method like following and passing initLocalJobConfFuncOpt to it can help avoid closure cleaning in this scenario. However, this would call for changes in all other places where sc.textFile is invoked. Intension was to allow user to make use of HadoopRDD directly (if needed) without having to incur the cost of closure cleaning (e.g in sql modules). Hence did not make those additional changes. ``` def newTextFile( path: String, initLocalJobConfFuncOpt: Option[JobConf => Unit], minPartitions: Int = defaultMinPartitions): RDD[String] = withScope { assertNotStopped() hadoopFile(path, classOf[TextInputFormat], initLocalJobConfFuncOpt, classOf[LongWritable], classOf[Text], minPartitions).map(pair => pair._2.toString).setName(path) } def hadoopFile[K, V]( path: String, inputFormatClass: Class[_ <: InputFormat[K, V]], initLocalJobConfFuncOpt: Option[JobConf => Unit], keyClass: Class[K], valueClass: Class[V], minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope { assertNotStopped() // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it. val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration)) new HadoopRDD( this, confBroadcast, initLocalJobConfFuncOpt, inputFormatClass, keyClass, valueClass, minPartitions).setName(path) } e.g sc.newTextFile(tmpFilePath, Some(setInputPathsFunc), 4).count() ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/11978#issuecomment-202048734 Tested with following suites along with the earlier sql suites org.apache.spark.FileSuite org.apache.spark.SparkContextSuite org.apache.spark.graphx.GraphLoaderSuite org.apache.spark.graphx.lib.SVDPlusPlusSuite org.apache.spark.metrics.InputOutputMetricsSuite org.apache.spark.ml.PipelineSuite org.apache.spark.ml.classification.DecisionTreeClassifierSuite org.apache.spark.ml.classification.LogisticRegressionSuite org.apache.spark.ml.classification.MultilayerPerceptronClassifierSuite org.apache.spark.ml.classification.NaiveBayesSuite org.apache.spark.ml.clustering.KMeansSuite org.apache.spark.ml.clustering.LDASuite org.apache.spark.ml.evaluation.BinaryClassificationEvaluatorSuite org.apache.spark.ml.evaluation.MulticlassClassificationEvaluatorSuite org.apache.spark.ml.evaluation.RegressionEvaluatorSuite org.apache.spark.ml.feature.BinarizerSuite org.apache.spark.ml.feature.BucketizerSuite org.apache.spark.ml.feature.ChiSqSelectorSuite org.apache.spark.ml.feature.CountVectorizerSuite org.apache.spark.ml.feature.DCTSuite org.apache.spark.ml.feature.ElementwiseProductSuite org.apache.spark.ml.feature.HashingTFSuite org.apache.spark.ml.feature.IDFSuite org.apache.spark.ml.feature.InteractionSuite org.apache.spark.ml.feature.MaxAbsScalerSuite org.apache.spark.ml.feature.MinMaxScalerSuite org.apache.spark.ml.feature.NGramSuite org.apache.spark.ml.feature.NormalizerSuite org.apache.spark.ml.feature.OneHotEncoderSuite org.apache.spark.ml.feature.PCASuite org.apache.spark.ml.feature.PolynomialExpansionSuite org.apache.spark.ml.feature.QuantileDiscretizerSuite org.apache.spark.ml.feature.RFormulaSuite org.apache.spark.ml.feature.RegexTokenizerSuite org.apache.spark.ml.feature.SQLTransformerSuite org.apache.spark.ml.feature.StandardScalerSuite org.apache.spark.ml.feature.StopWordsRemoverSuite org.apache.spark.ml.feature.StringIndexerSuite org.apache.spark.ml.feature.TokenizerSuite org.apache.spark.ml.feature.VectorAssemblerSuite org.apache.spark.ml.feature.VectorIndexerSuite org.apache.spark.ml.feature.VectorSlicerSuite org.apache.spark.ml.feature.Word2VecSuite org.apache.spark.ml.recommendation.ALSSuite org.apache.spark.ml.regression.AFTSurvivalRegressionSuite org.apache.spark.ml.regression.DecisionTreeRegressorSuite org.apache.spark.ml.regression.GeneralizedLinearRegressionSuite org.apache.spark.ml.regression.IsotonicRegressionSuite org.apache.spark.ml.regression.LinearRegressionSuite org.apache.spark.ml.source.libsvm.LibSVMRelationSuite org.apache.spark.ml.tuning.CrossValidatorSuite org.apache.spark.ml.util.DefaultReadWriteSuite org.apache.spark.mllib.classification.LogisticRegressionSuite org.apache.spark.mllib.classification.NaiveBayesSuite org.apache.spark.mllib.classification.SVMSuite org.apache.spark.mllib.clustering.GaussianMixtureSuite org.apache.spark.mllib.clustering.KMeansSuite org.apache.spark.mllib.clustering.LDASuite org.apache.spark.mllib.clustering.PowerIterationClusteringSuite org.apache.spark.mllib.feature.ChiSqSelectorSuite org.apache.spark.mllib.feature.Word2VecSuite org.apache.spark.mllib.fpm.FPGrowthSuite org.apache.spark.mllib.recommendation.MatrixFactorizationModelSuite org.apache.spark.mllib.regression.IsotonicRegressionSuite org.apache.spark.mllib.regression.LassoSuite org.apache.spark.mllib.regression.LinearRegressionSuite org.apache.spark.mllib.regression.RidgeRegressionSuite org.apache.spark.mllib.tree.DecisionTreeSuite org.apache.spark.mllib.tree.GradientBoostedTreesSuite org.apache.spark.mllib.tree.RandomForestSuite org.apache.spark.mllib.util.MLUtilsSuite org.apache.spark.rdd.HadoopRDD, org.apache.spark.rdd.MapPartitionsRDD, org.apache.spark.rdd.PairRDDFunctionsSuite org.apache.spark.repl.ReplSuite org.apache.spark.sql.execution.datasources.csv.CSVSuite org.apache.spark.sql.execution.datasources.json.JsonSuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/11978 SPARK-14113. Consider marking JobConf closure-cleaning in HadoopRDD a⦠## What changes were proposed in this pull request? In HadoopRDD, the following code was introduced as a part of SPARK-6943. `` if (initLocalJobConfFuncOpt.isDefined) { sparkContext.clean(initLocalJobConfFuncOpt.get) } `` Passing initLocalJobConfFuncOpt to HadoopRDD incurs good performance penalty (due to closure cleaning) with large number of RDDs. This would be invoked for every HadoopRDD initialization causing the bottleneck. example threadstack is given below `` at org.apache.xbean.asm5.ClassReader.a(Unknown Source) at org.apache.xbean.asm5.ClassReader.readUTF8(Unknown Source) at org.apache.xbean.asm5.ClassReader.a(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:402) at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:390) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:390) at org.apache.xbean.asm5.ClassReader.a(Unknown Source) at org.apache.xbean.asm5.ClassReader.b(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:224) at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:223) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:223) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) at org.apache.spark.SparkContext.clean(SparkContext.scala:2079) at org.apache.spark.rdd.HadoopRDD.(HadoopRDD.scala:112) `` This PR request does the following 1. Remove the closure cleaning in HadoopRDD init, which was mainly added to check if HadoopRDD can be made serializable or not. 2. Directly instantiate HadoopRDD in OrcRelation, instead of going via SparkContext.hadoopRDD (which internally invokes threaddump in "withScope"). Clubbing this change instead of making a separate ticket for this minor change. ## How was this patch tested? No new tests have been added. Used the following code to measure overhead the HadoopRDD init codepath. With patch it took 340ms as opposed to 4815ms without patch. Also tested with number of queries from TPC-DS in multi node environment. Along with, ran the following unit tests org.apache.spark.sql.hive.execution.HiveCompatibilitySuite,org.apache.spark.sql.hive.execution.HiveQuerySuite,org.apache.spark.sql.hive.execution.PruningSuite,org.apache.spark.sql.hive.CachedTableSuite,org.apache.spark.rdd.RDDOperationScopeSuite,org.apache.spark.ui.jobs.JobProgressListenerSuite `` test("Check timing for HadoopRDD init") { val start: Long = System.currentTimeMillis(); val initializeJobConfFunc = HadoopTableReader.initializeLocalJobConfFunc ("", null) _ Utils.withDummyCallSite(sqlContext.sparkContext) { // Large tables end up creating 5500 RDDs for(i <- 1 to 5500) { // ignore nulls in RDD as its mainly for testing timing of RDD creation val testRDD = new HadoopRDD(sqlContext.sparkContext, null, Some(initializeJobConfFunc), null, classOf[NullWritable], classOf[Writable], 10) } } val end: Long = System.currentTimeMillis(); println("Time taken : " + (end - start)) } `` Without Patch: (time taken to init 5000 HadoopRDD)
[GitHub] spark pull request: SPARK-14091 [core] Consider improving performa...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/11911#issuecomment-200552367 Thanks @JoshRosen and @srowen . Retested with "lazy val" which has the same perf improvement. Added "lazy val" in latest commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14091 [core] Consider improving performa...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/11911#discussion_r57150734 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1745,11 +1745,16 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * has overridden the call site using `setCallSite()`, this will return the user's version. */ private[spark] def getCallSite(): CallSite = { -val callSite = Utils.getCallSite() -CallSite( - Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), - Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) -) +var (shortForm, longForm) = (getLocalProperty(CallSite.SHORT_FORM), + getLocalProperty(CallSite.LONG_FORM)) + +if (shortForm == null || longForm == null) { + val callSite = Utils.getCallSite() + shortForm = callSite.shortForm --- End diff -- Thanks @srowen. In Utils.withDummyCallSite(), both LONG_FORM and SHORT_FORM are explicitly set to "". But I can see that it is possible to explicitly set one of them via setCallSite(shortCallSite). Incorporated your review comments in latest commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14091 [core] Consider improving performa...
Github user rajeshbalamohan commented on a diff in the pull request: https://github.com/apache/spark/pull/11911#discussion_r57148521 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1745,10 +1745,11 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * has overridden the call site using `setCallSite()`, this will return the user's version. */ private[spark] def getCallSite(): CallSite = { -val callSite = Utils.getCallSite() CallSite( - Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), - Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) + Option(getLocalProperty(CallSite.SHORT_FORM)) --- End diff -- Thanks @srowen . Incorporated the review comments in the latest commit. Please review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-14091 [core] Consider improving performa...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/11911 SPARK-14091 [core] Consider improving performance of SparkContext.get⦠## What changes were proposed in this pull request? Currently SparkContext.getCallSite() makes a call to Utils.getCallSite(). private[spark] def getCallSite(): CallSite = { val callSite = Utils.getCallSite() CallSite( Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm), Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm) ) } However, in some places utils.withDummyCallSite(sc) is invoked to avoid expensive threaddumps within getCallSite(). But Utils.getCallSite() is evaluated earlier causing threaddumps to be computed. This would impact when lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs are present, which can have significant impact when entire query runtime is in the order of 10-20 seconds) Creating this jira to consider evaluating getCallSite only when needed. ## How was this patch tested? No new test cases are added. Following standalone test was tried out manually. Also, built entire spark binary and tried with few SQL queries in TPC-DS and TPC-H in multi node cluster def run(): Unit = { val conf = new SparkConf() val sc = new SparkContext("local[1]", "test-context", conf) val start: Long = System.currentTimeMillis(); val confBroadcast = sc.broadcast(new SerializableConfiguration(new Configuration())) Utils.withDummyCallSite(sc) { //Large tables end up creating 5500 RDDs for(i <- 1 to 5000) { val testRDD = new HadoopRDD(sc, confBroadcast, None, null, classOf[NullWritable], classOf[Writable], 10) } } val end: Long = System.currentTimeMillis(); println("Time taken : " + (end - start)) } def main(args: Array[String]): Unit = { run } (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) â¦CallSite() (rbalamohan) You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-14091 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11911.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11911 commit dba630b854d6fdb298f8ef7ed25acf497f0eeebe Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-03-23T04:57:01Z SPARK-14091 [core] Consider improving performance of SparkContext.getCallSite() (rbalamohan) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12925. Improve HiveInspectors.unwrap for...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/11477#issuecomment-192146953 Thanks @srowen . Incorporated the changes. This was tested with HiveCompatibilitySuite, HiveQuerySuite. These tests ran fine in master branch without the changes as well. However, when tried with 1.6 branch, these test suites failed with the copy issues. Hence doing explicit bytes copy in master, so that this does not fail in future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12925. Improve HiveInspectors.unwrap for...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/11477 SPARK-12925. Improve HiveInspectors.unwrap for StringObjectInspector.⦠Earlier fix did not copy the bytes and it is possible for higher level to reuse Text object. This was causing issues. Proposed fix now copies the bytes from Text. This still avoids the expensive encoding/decoding You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-12925.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11477.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11477 commit d46d41ea75ecfeaef208f6c54222f23c24ebd3b0 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-03-02T23:39:21Z SPARK-12925. Improve HiveInspectors.unwrap for StringObjectInspector.getPrimitiveWritableObject. Earlier fix did not copy the bytes and it is possible for higher level to reuse Text object. This was causing issues. Proposed fix now copies the bytes from Text. This still avoids the expensive encoding/decoding --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/10842#issuecomment-185519589 Closed #10375 which was the dup of this pull request. Review can be done on this pull request. Thanks @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...
Github user rajeshbalamohan closed the pull request at: https://github.com/apache/spark/pull/10375 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/10375#issuecomment-185519219 Closing this as dup of #10842 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...
GitHub user rajeshbalamohan reopened a pull request: https://github.com/apache/spark/pull/10842 SPARK-12417. [SQL] Orc bloom filter options are not propagated during⦠Add option to have bloom filters in ORC write codepath. Added changes to apply cleanly in master branch. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-12417 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10842.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10842 commit 8615764056bc9039933ca97d85564cf60097fb5a Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-01-19T04:16:28Z SPARK-12417. [SQL] Orc bloom filter options are not propagated during file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...
Github user rajeshbalamohan closed the pull request at: https://github.com/apache/spark/pull/10842 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12920. [SQL]. Spark thrift server can ru...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/10846#issuecomment-184185874 @JoshRosen - Can you please let me know on proceeding with this patch?. Patch reduces the CPU utilization of spark-thrift server in multi-user environment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12948. [SQL]. Consider reducing size of ...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/10861#issuecomment-175584028 @JoshRosen - Please let me know if my latest comment on the usecase addresses your question. Can you. >> may be worth a holistic design review because I think there are some hacks in Spark SQL to address this problem there and it would be nice to have a unified solution for this >> Can you plz provide more details/pointers on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12998 [SQL]. Enable OrcRelation when con...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/10938 SPARK-12998 [SQL]. Enable OrcRelation when connecting via spark thrif⦠When a user connects via spark-thrift server to execute SQL, it does not enable PPD with ORC. It ends up creating MetastoreRelation which does not have ORC PPD. Purpose of this JIRA is to convert MetastoreRelation to OrcRelation in HiveMetastoreCatalog, so that users can benefit from PPD even when connecting to spark-thrift server. For e.g, Query "select count(1) from tpch_flat_orc_1000.lineitem where l_shipdate = '1990-04-18'" which is fired against spark-thrift-server or sqlContext would end up using "OrcRelation" to make use of PPD instead of MetastoreRelation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark spark-12998 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10938.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10938 commit 1a5b164153df946d713b34727a001d5005479d1a Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-01-27T02:19:53Z SPARK-12998 [SQL]. Enable OrcRelation when connecting via spark thrift server. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12948. [SQL]. Consider reducing size of ...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/10861#issuecomment-174355176 **Usecase**: User tries to map the dataset which is partitioned (e.g TPC-DS dataset at 200 GB scale) & runs a query in spark-shell. E.g ... val o_store_sales = sqlContext.read.format("orc").load("/tmp/spark_tpcds_bin_partitioned_orc_200/store_sales") o_store_sales.registerTempTable("o_store_sales") .. sqlContext.sql("SELECT..").show(); ... When this is executed, OrcRelation creates Config objects for every partition (Ref: [OrcRelation.execute()](https://github.com/apache/spark/blob/e14817b528ccab4b4685b45a95e2325630b5fc53/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala#L295)). In the case of TPC-DS, it generates 1826 partitions. This info is broadcasted in [DAGScheduler#submitMissingTasks()](https://github.com/apache/spark/blob/1b2c2162af4d5d2d950af94571e69273b49bf913/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1010). As a part of this, the configurations created for 1826 partitions are also streamed through (i.e embedded in HadoopMapParitionsWithSplitRDD -->f()--> wrappedConf). Each of these configuration takes around 251 KB per partition. Please refer to the profiler snapshot attached in the JIRA ([mem_snap_shot](https://issues.apache.org/jira/secure/attachment/12784080/SPARK-12948.mem.prof.snapshot.png)). This causes quite a bit of delay in the overall job runtim e. Patch reuses the already broadcastedconf from SparkContext. fillObject() function is executed later for every partition, which internally sets up any additional config details. This drastically reduces the amount of payload that is broadcasted and helps in reducing the overall job runtime. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12920. [SQL]. Spark thrift server can ru...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/10846#issuecomment-173741417 Thanks @JoshRosen . The current patch is based on flagging approach (in case of retaining caching) which would be safe for 1.6.x. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12948. [SQL]. Consider reducing size of ...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/10861 SPARK-12948. [SQL]. Consider reducing size of broadcasts in OrcRelation Size of broadcasted data in OrcRelation was significantly higher when running query with large number of partitions (e.g TPC-DS). And it has an impact on the job runtime. This would be more evident when there is large number of partitions/splits. Profiler snapshot is attached in SPARK-12948 (https://issues.apache.org/jira/secure/attachment/12783513/SPARK-12948_cpuProf.png). You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-12948 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10861.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10861 commit 4da7a22d2195c77e27aa4f3aa957b1fdc0d57f5b Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-01-21T05:09:09Z SPARK-12948. [SQL]. Consider reducing size of broadcasts in OrcRelation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12920. [SQL]. Spark thrift server can ru...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/10846 SPARK-12920. [SQL]. Spark thrift server can run at very high CPU with⦠Spark thrift server runs at very high CPU when concurrent users submit queries to the system over a period of time. Profiling revealed it is due to many Conf objects getting cacheed in HadoopRDD and this causing lots of pressure for GC when running queries on datasets with large number of partitions. Also, job UI retention causes issues with large jobs. Details are mentioned in SPARK-12920 and profiler snapshots are attached. Fix introduces "spark.hadoop.cacheConf" to optionally cache the jobConf in HadoopRDD. JobProgressListener fixes are related to trimming the job/stage details. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-12920 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10846.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10846 commit 12d77beb0c6c4e3f336aab43a5907775da715753 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-01-20T09:14:36Z SPARK-12920. [SQL]. Spark thrift server can run at very high CPU with concurrent users --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12925. [SQL]. Improve HiveInspectors.unw...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/10848 SPARK-12925. [SQL]. Improve HiveInspectors.unwrap for StringObjectIns⦠Text is in UTF-8 and converting it via "UTF8String.fromString" incurs decoding and encoding, which turns out to be expensive and redundant. Profiler snapshot details is attached in the JIRA (ref:https://issues.apache.org/jira/secure/attachment/12783331/SPARK-12925_profiler_cpu_samples.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-12925 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10848.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10848 commit 30cc93246828da9728891f9ed6d65b26bcb888af Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-01-20T12:15:56Z SPARK-12925. [SQL]. Improve HiveInspectors.unwrap for StringObjectInspector.getPrimitiveWritableObject --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12898. Consider having dummyCallSite for...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/10825#issuecomment-173037438 getCallSite gets the thread stack trace (+ additional processing). This is executed numerous number of times when running a query on TPC-DS (with 1800+ partition files) dataset. I have attached the profiler info in SPARK-12898 (https://issues.apache.org/jira/secure/attachment/12783232/callsiteProf.png). Having dummycallsite eliminates these stracktrace calls and reduces the overall job runtime. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/10842 SPARK-12417. [SQL] Orc bloom filter options are not propagated during⦠Add option to have bloom filters in ORC write codepath. Added changes to apply cleanly in master branch. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-12417 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10842.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10842 commit 8615764056bc9039933ca97d85564cf60097fb5a Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-01-19T04:16:28Z SPARK-12417. [SQL] Orc bloom filter options are not propagated during file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12898. Consider having dummyCallSite for...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/10825#issuecomment-173106715 Thanks for review. I have added a comment in the code for the same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12898. Consider having dummyCallSite for...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/10825 SPARK-12898. Consider having dummyCallSite for HiveTableScan Currently, HiveTableScan runs with getCallSite which is really expensive and shows up when scanning through large table with partitions (e.g TPC-DS) which slows down the overall runtime of the job. It would be good to consider having dummyCallSite in HiveTableScan. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark SPARK-12898 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10825.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10825 commit 3a32561eb905b236014cad74472c3a8c359b1aa0 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2016-01-19T04:27:52Z SPARK-12898. Consider having dummyCallSite for HiveTableScan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...
Github user rajeshbalamohan commented on the pull request: https://github.com/apache/spark/pull/10375#issuecomment-166486869 Thanks @zhzhan. Enabled orc PPD by default and also added a test case for bloom filters in the latest commit. ORC RecordReaderImpl is not public in the version of hive that is supported in spark; Hence relying on FileDump utility from ORC to test bloom fitlers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...
GitHub user rajeshbalamohan opened a pull request: https://github.com/apache/spark/pull/10375 SPARK-12417. [SQL] Orc bloom filter options are not propagated during file ⦠You can merge this pull request into a Git repository by running: $ git pull https://github.com/rajeshbalamohan/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10375.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10375 commit 72436a94720bc73ff617a83337a321586c9a4de2 Author: Rajesh Balamohan <rbalamo...@apache.org> Date: 2015-12-18T10:08:05Z SPARK-12417. Orc bloom filter options are not propagated during file write in spark --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org