[jira] [Created] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions
Nitin Goyal created SPARK-11179: --- Summary: Push filters through aggregate if filters are subset of 'group by' expressions Key: SPARK-11179 URL: https://issues.apache.org/jira/browse/SPARK-11179 Project: Spark Issue Type: Improvement Components: SQL Reporter: Nitin Goyal Priority: Minor Fix For: 1.6.0 Push filters through aggregate if filters are subset of 'group by' expressions. This optimisation can be added in Spark SQL's Optimizer class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
Nitin Goyal created SPARK-7970: -- Summary: Optimize code for SQL queries fired on Union of RDDs (closure cleaner) Key: SPARK-7970 URL: https://issues.apache.org/jira/browse/SPARK-7970 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 1.3.0, 1.2.0 Reporter: Nitin Goyal Closure cleaner slows down the execution of Spark SQL queries fired on union of RDDs. The time increases linearly at driver side with number of RDDs unioned. Refer following thread for more context :- http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html As can be seen in attached screenshots of Jprofiler, lot of time is getting consumed in getClassReader method of ClosureCleaner and rest in ensureSerializable (atleast in my case) This can be fixed in two ways (as per my current understanding) :- 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls ClosureCleaner clean method. 2. Fix at Spark core level - (i) Make checkSerializable property driven in SparkContext's clean method (ii) Somehow cache classreader for last 'n' classes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
[ https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitin Goyal updated SPARK-7970: --- Attachment: Screen Shot 2015-05-27 at 11.07.02 pm.png Optimize code for SQL queries fired on Union of RDDs (closure cleaner) -- Key: SPARK-7970 URL: https://issues.apache.org/jira/browse/SPARK-7970 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Nitin Goyal Attachments: Screen Shot 2015-05-27 at 11.07.02 pm.png Closure cleaner slows down the execution of Spark SQL queries fired on union of RDDs. The time increases linearly at driver side with number of RDDs unioned. Refer following thread for more context :- http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html As can be seen in attached screenshots of Jprofiler, lot of time is getting consumed in getClassReader method of ClosureCleaner and rest in ensureSerializable (atleast in my case) This can be fixed in two ways (as per my current understanding) :- 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls ClosureCleaner clean method. 2. Fix at Spark core level - (i) Make checkSerializable property driven in SparkContext's clean method (ii) Somehow cache classreader for last 'n' classes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
[ https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitin Goyal updated SPARK-7970: --- Attachment: Screen Shot 2015-05-27 at 11.01.03 pm.png Optimize code for SQL queries fired on Union of RDDs (closure cleaner) -- Key: SPARK-7970 URL: https://issues.apache.org/jira/browse/SPARK-7970 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Nitin Goyal Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 2015-05-27 at 11.07.02 pm.png Closure cleaner slows down the execution of Spark SQL queries fired on union of RDDs. The time increases linearly at driver side with number of RDDs unioned. Refer following thread for more context :- http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html As can be seen in attached screenshots of Jprofiler, lot of time is getting consumed in getClassReader method of ClosureCleaner and rest in ensureSerializable (atleast in my case) This can be fixed in two ways (as per my current understanding) :- 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls ClosureCleaner clean method. 2. Fix at Spark core level - (i) Make checkSerializable property driven in SparkContext's clean method (ii) Somehow cache classreader for last 'n' classes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
[ https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitin Goyal updated SPARK-7970: --- Description: Closure cleaner slows down the execution of Spark SQL queries fired on union of RDDs. The time increases linearly at driver side with number of RDDs unioned. Refer following thread for more context :- http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html As can be seen in attached screenshots of Jprofiler, lot of time is getting consumed in getClassReader method of ClosureCleaner and rest in ensureSerializable (atleast in my case) This can be fixed in two ways (as per my current understanding) :- 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls ClosureCleaner clean method (See PR - https://github.com/apache/spark/pull/6256). 2. Fix at Spark core level - (i) Make checkSerializable property driven in SparkContext's clean method (ii) Somehow cache classreader for last 'n' classes was: Closure cleaner slows down the execution of Spark SQL queries fired on union of RDDs. The time increases linearly at driver side with number of RDDs unioned. Refer following thread for more context :- http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html As can be seen in attached screenshots of Jprofiler, lot of time is getting consumed in getClassReader method of ClosureCleaner and rest in ensureSerializable (atleast in my case) This can be fixed in two ways (as per my current understanding) :- 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls ClosureCleaner clean method. 2. Fix at Spark core level - (i) Make checkSerializable property driven in SparkContext's clean method (ii) Somehow cache classreader for last 'n' classes Optimize code for SQL queries fired on Union of RDDs (closure cleaner) -- Key: SPARK-7970 URL: https://issues.apache.org/jira/browse/SPARK-7970 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Nitin Goyal Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 2015-05-27 at 11.07.02 pm.png Closure cleaner slows down the execution of Spark SQL queries fired on union of RDDs. The time increases linearly at driver side with number of RDDs unioned. Refer following thread for more context :- http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html As can be seen in attached screenshots of Jprofiler, lot of time is getting consumed in getClassReader method of ClosureCleaner and rest in ensureSerializable (atleast in my case) This can be fixed in two ways (as per my current understanding) :- 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls ClosureCleaner clean method (See PR - https://github.com/apache/spark/pull/6256). 2. Fix at Spark core level - (i) Make checkSerializable property driven in SparkContext's clean method (ii) Somehow cache classreader for last 'n' classes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala
[ https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitin Goyal updated SPARK-7331: --- Description: A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets created once. was: A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. Create HiveConf per application instead of per query in HiveQl.scala Key: SPARK-7331 URL: https://issues.apache.org/jira/browse/SPARK-7331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Nitin Goyal Priority: Minor A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets created once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala
[ https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitin Goyal updated SPARK-7331: --- Description: A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. was: A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. Create HiveConf per application instead of per query in HiveQl.scala Key: SPARK-7331 URL: https://issues.apache.org/jira/browse/SPARK-7331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Nitin Goyal Priority: Minor A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala
Nitin Goyal created SPARK-7331: -- Summary: Create HiveConf per application instead of per query in HiveQl.scala Key: SPARK-7331 URL: https://issues.apache.org/jira/browse/SPARK-7331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.2.0 Reporter: Nitin Goyal Priority: Minor A new HiveConf is created per query in getAst method in HiveQl.scala def getAst(sql: String): ASTNode = { /* * Context has to be passed in hive0.13.1. * Otherwise, there will be Null pointer exception, * when retrieving properties form HiveConf. */ val hContext = new Context(new HiveConf()) val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, hContext)) hContext.clear() node } Creating hiveConf adds a minimum of 90ms delay per query. So moving its creation in Object such that it gets initialised once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5880) Change log level of batch pruning string in InMemoryColumnarTableScan from Info to Debug
Nitin Goyal created SPARK-5880: -- Summary: Change log level of batch pruning string in InMemoryColumnarTableScan from Info to Debug Key: SPARK-5880 URL: https://issues.apache.org/jira/browse/SPARK-5880 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1, 1.3.0 Reporter: Nitin Goyal Priority: Trivial Fix For: 1.3.0 In InMemoryColumnarTableScan, we make string of the statistics of all the columns and log them at INFO level whenever batch pruning happens. We get a performance hit in case there are a large number of batches and good number of columns and almost every batch gets pruned. We can make the string to evaluate lazily and change log level to DEBUG -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching
Nitin Goyal created SPARK-4849: -- Summary: Pass partitioning information (distribute by) to In-memory caching Key: SPARK-4849 URL: https://issues.apache.org/jira/browse/SPARK-4849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Nitin Goyal Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching
[ https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitin Goyal updated SPARK-4849: --- Description: HQL distribute by column_name partitions data based on specified column values. We can pass this information to in-memory caching for further performance improvements. e..g. in Joins, an extra partition step can be saved based on this information. Refer - http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html Pass partitioning information (distribute by) to In-memory caching -- Key: SPARK-4849 URL: https://issues.apache.org/jira/browse/SPARK-4849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Nitin Goyal Priority: Minor HQL distribute by column_name partitions data based on specified column values. We can pass this information to in-memory caching for further performance improvements. e..g. in Joins, an extra partition step can be saved based on this information. Refer - http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org