[jira] [Created] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions

2015-10-19 Thread Nitin Goyal (JIRA)
Nitin Goyal created SPARK-11179:
---

 Summary: Push filters through aggregate if filters are subset of 
'group by' expressions
 Key: SPARK-11179
 URL: https://issues.apache.org/jira/browse/SPARK-11179
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Nitin Goyal
Priority: Minor
 Fix For: 1.6.0


Push filters through aggregate if filters are subset of 'group by' expressions. 
This optimisation can be added in Spark SQL's Optimizer class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-05-30 Thread Nitin Goyal (JIRA)
Nitin Goyal created SPARK-7970:
--

 Summary: Optimize code for SQL queries fired on Union of RDDs 
(closure cleaner)
 Key: SPARK-7970
 URL: https://issues.apache.org/jira/browse/SPARK-7970
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 1.3.0, 1.2.0
Reporter: Nitin Goyal


Closure cleaner slows down the execution of Spark SQL queries fired on union of 
RDDs. The time increases linearly at driver side with number of RDDs unioned. 
Refer following thread for more context :-

http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html

As can be seen in attached screenshots of Jprofiler, lot of time is getting 
consumed in getClassReader method of ClosureCleaner and rest in 
ensureSerializable (atleast in my case)

This can be fixed in two ways (as per my current understanding) :-

1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
ClosureCleaner clean method.

2. Fix at Spark core level -
  (i) Make checkSerializable property driven in SparkContext's clean method
  (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-05-30 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-7970:
---
Attachment: Screen Shot 2015-05-27 at 11.07.02 pm.png

 Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
 --

 Key: SPARK-7970
 URL: https://issues.apache.org/jira/browse/SPARK-7970
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Nitin Goyal
 Attachments: Screen Shot 2015-05-27 at 11.07.02 pm.png


 Closure cleaner slows down the execution of Spark SQL queries fired on union 
 of RDDs. The time increases linearly at driver side with number of RDDs 
 unioned. Refer following thread for more context :-
 http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
 As can be seen in attached screenshots of Jprofiler, lot of time is getting 
 consumed in getClassReader method of ClosureCleaner and rest in 
 ensureSerializable (atleast in my case)
 This can be fixed in two ways (as per my current understanding) :-
 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
 MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
 ClosureCleaner clean method.
 2. Fix at Spark core level -
   (i) Make checkSerializable property driven in SparkContext's clean method
   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-05-30 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-7970:
---
Attachment: Screen Shot 2015-05-27 at 11.01.03 pm.png

 Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
 --

 Key: SPARK-7970
 URL: https://issues.apache.org/jira/browse/SPARK-7970
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Nitin Goyal
 Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
 2015-05-27 at 11.07.02 pm.png


 Closure cleaner slows down the execution of Spark SQL queries fired on union 
 of RDDs. The time increases linearly at driver side with number of RDDs 
 unioned. Refer following thread for more context :-
 http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
 As can be seen in attached screenshots of Jprofiler, lot of time is getting 
 consumed in getClassReader method of ClosureCleaner and rest in 
 ensureSerializable (atleast in my case)
 This can be fixed in two ways (as per my current understanding) :-
 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
 MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
 ClosureCleaner clean method.
 2. Fix at Spark core level -
   (i) Make checkSerializable property driven in SparkContext's clean method
   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-05-30 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-7970:
---
Description: 
Closure cleaner slows down the execution of Spark SQL queries fired on union of 
RDDs. The time increases linearly at driver side with number of RDDs unioned. 
Refer following thread for more context :-

http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html

As can be seen in attached screenshots of Jprofiler, lot of time is getting 
consumed in getClassReader method of ClosureCleaner and rest in 
ensureSerializable (atleast in my case)

This can be fixed in two ways (as per my current understanding) :-

1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
ClosureCleaner clean method (See PR - 
https://github.com/apache/spark/pull/6256).

2. Fix at Spark core level -
  (i) Make checkSerializable property driven in SparkContext's clean method
  (ii) Somehow cache classreader for last 'n' classes

  was:
Closure cleaner slows down the execution of Spark SQL queries fired on union of 
RDDs. The time increases linearly at driver side with number of RDDs unioned. 
Refer following thread for more context :-

http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html

As can be seen in attached screenshots of Jprofiler, lot of time is getting 
consumed in getClassReader method of ClosureCleaner and rest in 
ensureSerializable (atleast in my case)

This can be fixed in two ways (as per my current understanding) :-

1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
ClosureCleaner clean method.

2. Fix at Spark core level -
  (i) Make checkSerializable property driven in SparkContext's clean method
  (ii) Somehow cache classreader for last 'n' classes


 Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
 --

 Key: SPARK-7970
 URL: https://issues.apache.org/jira/browse/SPARK-7970
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Nitin Goyal
 Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
 2015-05-27 at 11.07.02 pm.png


 Closure cleaner slows down the execution of Spark SQL queries fired on union 
 of RDDs. The time increases linearly at driver side with number of RDDs 
 unioned. Refer following thread for more context :-
 http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
 As can be seen in attached screenshots of Jprofiler, lot of time is getting 
 consumed in getClassReader method of ClosureCleaner and rest in 
 ensureSerializable (atleast in my case)
 This can be fixed in two ways (as per my current understanding) :-
 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
 MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
 ClosureCleaner clean method (See PR - 
 https://github.com/apache/spark/pull/6256).
 2. Fix at Spark core level -
   (i) Make checkSerializable property driven in SparkContext's clean method
   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala

2015-05-03 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-7331:
---
Description: 
A new HiveConf is created per query in getAst method in HiveQl.scala

  def getAst(sql: String): ASTNode = {

/*
 * Context has to be passed in hive0.13.1.
 * Otherwise, there will be Null pointer exception,
 * when retrieving properties form HiveConf.
 */

val hContext = new Context(new HiveConf())

val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
hContext))

hContext.clear()

node

  }

Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
creation in Object such that it gets created once.

  was:
A new HiveConf is created per query in getAst method in HiveQl.scala

  def getAst(sql: String): ASTNode = {

/*
 * Context has to be passed in hive0.13.1.
 * Otherwise, there will be Null pointer exception,
 * when retrieving properties form HiveConf.
 */

val hContext = new Context(new HiveConf())

val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
hContext))

hContext.clear()

node

  }

Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
creation in Object such that it gets initialised once.


 Create HiveConf per application instead of per query in HiveQl.scala
 

 Key: SPARK-7331
 URL: https://issues.apache.org/jira/browse/SPARK-7331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Nitin Goyal
Priority: Minor

 A new HiveConf is created per query in getAst method in HiveQl.scala
   def getAst(sql: String): ASTNode = {
 /*
  * Context has to be passed in hive0.13.1.
  * Otherwise, there will be Null pointer exception,
  * when retrieving properties form HiveConf.
  */
 val hContext = new Context(new HiveConf())
 val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
 hContext))
 hContext.clear()
 node
   }
 Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
 creation in Object such that it gets created once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala

2015-05-03 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-7331:
---
Description: 
A new HiveConf is created per query in getAst method in HiveQl.scala

  def getAst(sql: String): ASTNode = {

/*
 * Context has to be passed in hive0.13.1.
 * Otherwise, there will be Null pointer exception,
 * when retrieving properties form HiveConf.
 */

val hContext = new Context(new HiveConf())

val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
hContext))

hContext.clear()

node

  }

Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
creation in Object such that it gets initialised once.

  was:
A new HiveConf is created per query in getAst method in HiveQl.scala

  def getAst(sql: String): ASTNode = {
/*
 * Context has to be passed in hive0.13.1.
 * Otherwise, there will be Null pointer exception,
 * when retrieving properties form HiveConf.
 */
val hContext = new Context(new HiveConf())
val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
hContext))
hContext.clear()
node
  }

Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
creation in Object such that it gets initialised once.


 Create HiveConf per application instead of per query in HiveQl.scala
 

 Key: SPARK-7331
 URL: https://issues.apache.org/jira/browse/SPARK-7331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Nitin Goyal
Priority: Minor

 A new HiveConf is created per query in getAst method in HiveQl.scala
   def getAst(sql: String): ASTNode = {
 /*
  * Context has to be passed in hive0.13.1.
  * Otherwise, there will be Null pointer exception,
  * when retrieving properties form HiveConf.
  */
 val hContext = new Context(new HiveConf())
 val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
 hContext))
 hContext.clear()
 node
   }
 Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
 creation in Object such that it gets initialised once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7331) Create HiveConf per application instead of per query in HiveQl.scala

2015-05-03 Thread Nitin Goyal (JIRA)
Nitin Goyal created SPARK-7331:
--

 Summary: Create HiveConf per application instead of per query in 
HiveQl.scala
 Key: SPARK-7331
 URL: https://issues.apache.org/jira/browse/SPARK-7331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.2.0
Reporter: Nitin Goyal
Priority: Minor


A new HiveConf is created per query in getAst method in HiveQl.scala

  def getAst(sql: String): ASTNode = {
/*
 * Context has to be passed in hive0.13.1.
 * Otherwise, there will be Null pointer exception,
 * when retrieving properties form HiveConf.
 */
val hContext = new Context(new HiveConf())
val node = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql, 
hContext))
hContext.clear()
node
  }

Creating hiveConf adds a minimum of 90ms delay per query. So moving its 
creation in Object such that it gets initialised once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5880) Change log level of batch pruning string in InMemoryColumnarTableScan from Info to Debug

2015-02-17 Thread Nitin Goyal (JIRA)
Nitin Goyal created SPARK-5880:
--

 Summary: Change log level of batch pruning string in 
InMemoryColumnarTableScan from Info to Debug
 Key: SPARK-5880
 URL: https://issues.apache.org/jira/browse/SPARK-5880
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1, 1.3.0
Reporter: Nitin Goyal
Priority: Trivial
 Fix For: 1.3.0


In InMemoryColumnarTableScan, we make string of the statistics of all the 
columns and log them at INFO level whenever batch pruning happens. We get a 
performance hit in case there are a large number of batches and good number of 
columns and almost every batch gets pruned.

We can make the string to evaluate lazily and change log level to DEBUG



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2014-12-15 Thread Nitin Goyal (JIRA)
Nitin Goyal created SPARK-4849:
--

 Summary: Pass partitioning information (distribute by) to 
In-memory caching
 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4849) Pass partitioning information (distribute by) to In-memory caching

2014-12-15 Thread Nitin Goyal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitin Goyal updated SPARK-4849:
---
Description: 
HQL distribute by column_name partitions data based on specified column 
values. We can pass this information to in-memory caching for further 
performance improvements. e..g. in Joins, an extra partition step can be saved 
based on this information.

Refer - 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html

 Pass partitioning information (distribute by) to In-memory caching
 --

 Key: SPARK-4849
 URL: https://issues.apache.org/jira/browse/SPARK-4849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Nitin Goyal
Priority: Minor

 HQL distribute by column_name partitions data based on specified column 
 values. We can pass this information to in-memory caching for further 
 performance improvements. e..g. in Joins, an extra partition step can be 
 saved based on this information.
 Refer - 
 http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-td20350.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org