from:"rajeshbalamohan"

[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...

2017-09-27 Thread rajeshbalamohan

Github user rajeshbalamohan closed the pull request at:

https://github.com/apache/spark/pull/19184


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-27 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/19184
  
Thanks @mridulm , @jerryshao , @viirya . closing this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-11 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/19184
  
Thanks @viirya . I have updated the patch to address your comments.

This fixes the "too many files open" issue for (e.g Q67, Q72, Q14 etc) 
which involves window functions; but for the merger the issue needs to be 
addressed still. Agreed that this would be partial patch. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...

2017-09-10 Thread rajeshbalamohan

Github user rajeshbalamohan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19184#discussion_r137973976
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
@@ -104,6 +124,10 @@ public void loadNext() throws IOException {
 if (taskContext != null) {
   taskContext.killTaskIfInterrupted();
 }
+if (this.din == null) {
+  // Good time to init (if all files are opened, we can get Too Many 
files exception)
+  initStreams();
+}
--- End diff --

Good point. PR has been tried with queries involving window functions (e.g 
Q67) for which it worked fine. 

During spill merges (esp getSortedIterator), it is possible to encounter 
too many open files issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-10 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/19184
  
I got into this with the limit of 32K.  "unlimited" is another option which 
can be a workaround for this. But that may not be a preferable option in 
production systems. For e.g, with Q67 I observed 9000+ spill files in the task. 
And with multiple tasks per executor, it ended up easily reaching the limits. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19184: [SPARK-21971][CORE] Too many open files in Spark ...

2017-09-10 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/19184

[SPARK-21971][CORE] Too many open files in Spark due to concurrent fiâ¦

â¦les being opened

## What changes were proposed in this pull request?

In UnsafeExternalSorter::getIterator(), for every spillWriter a file is 
opened in UnsafeSorterSpillReader and these files get closed later point in 
time as a part of close() call. 
However, when large number of spill files are present, number of files 
opened increases to a great extent and ends up throwing "Too many files" open 
exception.
This can easily be reproduced with TPC-DS Q67 at 1 TB scale in multi node 
cluster with multiple cores per executor. 

There are ways to reduce the number of spill files that are generated in 
Q67. E.g, increase "spark.sql.windowExec.buffer.spill.threshold" where 4096 is 
the default. Another option is to increase ulimit to much higher values.
But those are workarounds. 

This PR reduces the number of files that are kept open at in 
UnsafeSorterSpillReader.


## How was this patch tested?
Manual testing of Q67 in 1 TB and 10 TB scale on multi node cluster.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-21971

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19184


commit dcc2960d5f60add9bfd9446df59b0d0d06365947
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2017-09-11T01:36:12Z

[SPARK-21971][CORE] Too many open files in Spark due to concurrent files 
being opened




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Use metastore schema instead of infer...

2016-09-22 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
@cloud-fan . Failure is related to the parquet changes introduced for 
returning metastoreSchema (it has issues with complex types). I am not very 
comfortable with the Parquet codepath. For time being, I would revert back the 
last change. We can create subsequent jira if needed for parq related changes; 
Alternatively I am fine with someone who is comfortable with parq code taking 
over this as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14537: [SPARK-16948][SQL] Use metastore schema instead o...

2016-09-21 Thread rajeshbalamohan

Github user rajeshbalamohan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14537#discussion_r79972251
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -237,21 +237,27 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   new Path(metastoreRelation.catalogTable.storage.locationUri.get),
   partitionSpec)
 
-val inferredSchema = if (fileType.equals("parquet")) {
-  val inferredSchema =
-defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles())
-  inferredSchema.map { inferred =>
-ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, 
inferred)
-  }.getOrElse(metastoreSchema)
-} else {
-  defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles()).get
+val schema = fileType match {
+  case "parquet" =>
+val inferredSchema =
+  defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles())
+
+// For Parquet, get correct schema by merging Metastore schema 
data types
--- End diff --

Sure. Will change to return metastoreSchema for parq as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Use metastore schema instead of infer...

2016-09-21 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
@cloud-fan 
>>
For branch 2.0, we should open another PR to fix the 
OrcFileFormat.inferSchema, to not throw FileNotFoundException for empty table.
>>
Code for not throwing FileNotFoundException in OrcFileFormat.inferSchema 
was removed from this patch. I can create separate JIRA for that; plz let me 
know if that is blocking this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Use metastore schema instead of infer...

2016-09-02 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
Sorry about the delay. Updated the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-25 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
Thanks @gatorsmile . Removed the changes related to OrcFileFormat


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-25 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
Fixed the test case name. I haven't changed the parquet code path as I 
wasn't sure on whether it would break any backward compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-24 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
Thanks @gatorsmile, it would be good to retain the change in 
OrcFileInputFormat's inferschema (just in case it is referenced later).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...

2016-08-24 Thread rajeshbalamohan

Github user rajeshbalamohan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14537#discussion_r76179877
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -54,10 +57,12 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   sparkSession: SparkSession,
   options: Map[String, String],
   files: Seq[FileStatus]): Option[StructType] = {
-OrcFileOperator.readSchema(
-  files.map(_.getPath.toUri.toString),
-  Some(sparkSession.sessionState.newHadoopConf())
-)
+// Safe to ignore FileNotFoundException in case no files are found.
+val schema = Try(OrcFileOperator.readSchema(
--- End diff --

Yes, in case this is referred anytime later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-24 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
ok, reverted the changes related to physical schema changes. In both cases, 
it returns metastoreschema, and mismatches can be handled separately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-24 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
For non-partitioned ORC, it is currently using the metastore schema and is 
not inferring the schema currently in HiveMetastoreCatalog, and hence not an 
issue. But the problem of wrong mapping (i.e physical col name in file being 
different than that of metastore) still exists.  The more I see it, it would be 
easier to club the patches in this JIRA itself.  If so, HiveMetastoreCatalog 
could just rely on metastoreSchema and ORCFileFormat can later do the mapping 
if the mappings are different. I will revise the patch to include this scenario 
and post it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...

2016-08-23 Thread rajeshbalamohan

Github user rajeshbalamohan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14537#discussion_r75967137
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -237,21 +237,26 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   new Path(metastoreRelation.catalogTable.storage.locationUri.get),
   partitionSpec)
 
-val inferredSchema = if (fileType.equals("parquet")) {
-  val inferredSchema =
-defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles())
-  inferredSchema.map { inferred =>
-ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, 
inferred)
-  }.getOrElse(metastoreSchema)
-} else {
-  defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles()).get
+val inferredSchema =
+  defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles())
+val schema = fileType match {
+  case "parquet" =>
+// For Parquet, get correct schema by merging Metastore schema 
data types
+// and Parquet schema field names.
+inferredSchema.map { schema =>
+  
ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, schema)
+}.getOrElse(metastoreSchema)
+  case "orc" =>
+inferredSchema.getOrElse(metastoreSchema)
+  case _ =>
+inferredSchema.get
--- End diff --

Thanks @mallman . Addressed this in latest commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...

2016-08-23 Thread rajeshbalamohan

Github user rajeshbalamohan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14537#discussion_r75902767
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -237,21 +237,26 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   new Path(metastoreRelation.catalogTable.storage.locationUri.get),
   partitionSpec)
 
-val inferredSchema = if (fileType.equals("parquet")) {
-  val inferredSchema =
-defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles())
-  inferredSchema.map { inferred =>
-ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, 
inferred)
-  }.getOrElse(metastoreSchema)
-} else {
-  defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles()).get
+val inferredSchema =
+  defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles())
+val schema = fileType match {
+  case "parquet" =>
+// For Parquet, get correct schema by merging Metastore schema 
data types
+// and Parquet schema field names.
+inferredSchema.map { schema =>
+  
ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, schema)
+}.getOrElse(metastoreSchema)
+  case "orc" =>
+inferredSchema.getOrElse(metastoreSchema)
+  case _ =>
+inferredSchema.get
--- End diff --

Not sure if exception has to be thrown in this case. Or just return null in 
this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-23 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
Thanks @gatorsmile. Addressed review comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-22 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
For latest ORC, if the data was written out by Hive, it would have the same 
mapping. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-22 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
Right, for Parquet this could be part of initial codebase (from Spark-1251 
I believe) which merges any metastore conflicts with parq files.  But in the 
case of ORC, this inference is still valid as the column names stored  in old 
ORC format could be different from that of Hive Metastore (e.g HIVE-4243). 
There is a separate PR:https://github.com/apache/spark/pull/14471 which track 
the ORC compatibility issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-22 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
Thanks @rxin .  Incorporated review comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-12 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
@rxin Can you please review when you find time? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-12 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
Thank you thejas and @mallman


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-12 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
@tejasapatil, @mallman - Can you please review when you find time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-09 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14537
  
Thanks @mallman . Fixed review comments in latest commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10846: [SPARK-12920][SQL] Fix high CPU usage in spark thrift se...

2016-08-08 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/10846
  
They take longer to clean up. If queries are executed continuously, major 
portion of thrift server wastes time in GC-ing.

IAC, I have removed the HadoopRDD in the recent commit and can be tracked 
in separate JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10846: [SPARK-12920][SQL] Fix high CPU usage in spark thrift se...

2016-08-08 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/10846
  
SoftRef causes lots of mem-pressure on thrift server. To be precise, when 
executing query with large dataset, it can very soon run at 1200% CPU and all 
threads carrying out just GC activities. That is for the HadoopRDD conf 
caching. Due to softRef they reach till GC threshold and gets cleared up. It 
does not OOM, but runs at very high CPU due to GC.

JobProgress* does not cleanup the data fast enough in some cases (e.g too 
many queries are executed continuously) and in such cases the memory pressure 
on thrift server increases.

Both of them contribute to the high CPU usage.  I am afraid that fixing one 
of them would still have the high-CPU usage issue.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...

2016-08-08 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/14537

[SPARK-16948][SQL] Querying empty partitioned orc tables throws excepâ¦

## What changes were proposed in this pull request?
Querying empty partitioned ORC tables from spark-sql throws exception with 
`spark.sql.hive.convertMetastoreOrc=true`.  This is due to the fact that the 
inferschema() would end up throwing `FileNotFoundException` when no files are 
present in partitioned orc tables.  Patch attempts to fix it, wherein it would 
fall back to metastore based schema information. 

## How was this patch tested?
Included unit tests and also tested it in small scale cluster.


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)


â¦tion

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-16948

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14537.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14537


commit 5721b88c7c816f57ef39374ac9b335d870543628
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-08-08T08:28:23Z

[SPARK-16948][SQL] Querying empty partitioned orc tables throws exception




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10846: [SPARK-12920][SQL] Fix high CPU usage in spark thrift se...

2016-08-07 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/10846
  
- Rebased to master and changed title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10846: SPARK-12920. [SQL]. Spark thrift server can run at very ...

2016-08-05 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/10846
  
Sorry about the delay. Missed this one. Haven't tested this recently. But 
yes, this would be a problem in master as well. Please let me know if i need to 
rebase this for master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14471: [SPARK-14387][SQL] Enable Hive-1.x ORC compatibility wit...

2016-08-03 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14471
  
Thanks @rxin.

Changes:
1. Added test case. Also added sample orc file (392 bytes) from Hive 1.x 
with format "Type: struct<_col0:int,_col1:string>".   Without this PR change in 
OrcFileFormat, it would end up throwing "java.lang.IllegalArgumentException: 
Field "key" does not exist." for the same test case.

2. Fixed the title of the JIRA and the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14471: [SPARK-14387][SQL] Exceptions thrown when querying ORC t...

2016-08-02 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/14471
  
Fixed scalastyle issues


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12293: [SPARK-14387][SQL] Exceptions thrown when queryin...

2016-08-02 Thread rajeshbalamohan

Github user rajeshbalamohan closed the pull request at:

https://github.com/apache/spark/pull/12293


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12293: [SPARK-14387][SQL] Exceptions thrown when querying ORC t...

2016-08-02 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/12293
  
@yuananf Thanks for trying it out. I have rebased it and created 
https://github.com/apache/spark/pull/14471. Closing this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14471: [SPARK-14387][SQL] Exceptions thrown when queryin...

2016-08-02 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/14471

[SPARK-14387][SQL] Exceptions thrown when querying ORC tables

## What changes were proposed in this pull request?
This PR improves ORCFileFormat to handle cases when schema stored in the 
ORC file does not match the schema stored in metastore. 

ORC Data written by Hive-1.x had virtual column names (HIVE-4243). This is 
fixed in Hive-2.x, but for data stored using Hive-1.x spark would throw 
exceptions. To mitigate this, "spark.sql.hve.convertMetastoreOrc" was disabled 
via SPARK-15705.  However, that would incur
performance penalties as it would go via HiveTableScan and HadoopRDD.  This 
PR fixes this issue.

Related tickets:
SPARK-15705 : Change the default value of 
spark.sql.hive.convertMetastoreOrc to false.
SPARK-15705 : Spark won't read ORC schema from metastore for partitioned 
tables
SPARK-16628 : OrcConversions should not convert an ORC table represented by 
MetastoreRelation to HadoopFsRelation if metastore schema does not match schema 
stored in ORC files


## How was this patch tested?
Manual testing by setting "spark.sql.hve.convertMetastoreOrc=true" and 
querying data stored via Hive-1.x in ORC format. Also ran unit-tests related to 
sql.

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-14387.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14471.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14471


commit dc943a445047a21a88ab19566eab672e8921dcc1
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-08-03T02:21:05Z

[SPARK-14387][SQL] Exceptions thrown when querying ORC tables




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13522: [SPARK-14321][SQL] Reduce date format cost and string-to...

2016-06-07 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/13522
  
Thank you. I have pushed the fixes in the recent commit. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12105: [SPARK-14321][SQL] Reduce date format cost and st...

2016-06-07 Thread rajeshbalamohan

Github user rajeshbalamohan closed the pull request at:

https://github.com/apache/spark/pull/12105


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12105: [SPARK-14321][SQL] Reduce date format cost and string-to...

2016-06-06 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/12105
  
Patch went stale for master branch and got little messy in my system. I 
have created https://github.com/apache/spark/pull/13522 which is rebased to 
master. Will close this after view.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12105: [SPARK-14321][SQL] Reduce date format cost and st...

2016-06-06 Thread rajeshbalamohan

Github user rajeshbalamohan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12105#discussion_r65885590
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -391,21 +393,24 @@ abstract class UnixTime extends BinaryExpression with 
ExpectsInputTypes {
   case StringType if right.foldable =>
 val sdf = classOf[SimpleDateFormat].getName
 val fString = if (constFormat == null) null else 
constFormat.toString
-val formatter = ctx.freshName("formatter")
 if (fString == null) {
   s"""
 boolean ${ev.isNull} = true;
 ${ctx.javaType(dataType)} ${ev.value} = 
${ctx.defaultValue(dataType)};
   """
 } else {
+  val formatter = ctx.freshName("formatter")
+  ctx.addMutableState(sdf, formatter, s"""$formatter = null;""")
--- End diff --

yes. Creating the formatter here did not create any issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13522: [SPARK-14321][SQL] Reduce date format cost and string-to...

2016-06-06 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/13522
  
@cloud-fan  - Sorry about the delay. Rebased SPARK-14321 for master. 
https://github.com/apache/spark/pull/12105 had become stale and got little 
messy in my system. Ended up creating this PR. I will close the earlier one 
after review. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13522: [SPARK-14321][SQL] Reduce date format cost and st...

2016-06-06 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/13522

[SPARK-14321][SQL] Reduce date format cost and string-to-date cost inâ¦

## What changes were proposed in this pull request?
Here is the generated code snippet when executing date functions. 
SimpleDateFormat is fairly expensive and can show up bottleneck when processing 
millions of records. It would be better to instantiate it once.

```
/* 066 */ UTF8String primitive5 = null;
/* 067 */ if (!isNull4) {
/* 068 */   try {
/* 069 */ primitive5 = UTF8String.fromString(new 
java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
/* 070 */ new java.util.Date(primitive7 * 1000L)));
/* 071 */   } catch (java.lang.Throwable e) {
/* 072 */ isNull4 = true;
/* 073 */   }
/* 074 */ }
```

With modified code, here is the generated code
```
/* 010 */   private java.text.SimpleDateFormat sdf2;
/* 011 */   private UnsafeRow result13;
/* 012 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder bufferHolder14;
/* 013 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter15;
/* 014 */
...
...
/* 065 */ boolean isNull0 = isNull3;
/* 066 */ UTF8String primitive1 = null;
/* 067 */ if (!isNull0) {
/* 068 */   try {
/* 069 */ if (sdf2 == null) {
/* 070 */   sdf2 = new java.text.SimpleDateFormat("-MM-dd 
HH:mm:ss");
/* 071 */ }
/* 072 */ primitive1 = UTF8String.fromString(sdf2.format(
/* 073 */ new java.util.Date(primitive4 * 1000L)));
/* 074 */   } catch (java.lang.Throwable e) {
/* 075 */ isNull0 = true;
/* 076 */   }
/* 077 */ }
```

Similarly Calendar.getInstance was used in DateTimeUtils which can be 
lazily inited.


## How was this patch tested?


org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite,org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-14321-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13522.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13522


commit 602d4a70ba845df3160a07c2c9afe2d5c3c574c4
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-06-06T12:54:02Z

[SPARK-14321][SQL] Reduce date format cost and string-to-date cost in date 
functions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14321][SQL] Reduce date format cost and...

2016-05-29 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12105#issuecomment-222408035
  
Sorry about the delay in responding to this. Will try to rebase and post 
the patch asap.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12998 [SQL]. Enable OrcRelation when con...

2016-05-03 Thread rajeshbalamohan

Github user rajeshbalamohan closed the pull request at:

https://github.com/apache/spark/pull/10938


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14387][SQL] Exceptions thrown when quer...

2016-05-01 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12293#issuecomment-216082665
  
\cc @liancheng , @rxin  - Can you please review when you find time?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...

2016-05-01 Thread rajeshbalamohan

Github user rajeshbalamohan closed the pull request at:

https://github.com/apache/spark/pull/11978


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...

2016-04-26 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/11978#issuecomment-214705705
  
@srowen  - With the master code base & the changes that went in 
(FileSourceStrategy to be specific), this PR would no longer be very relevant 
in master codebase. This would be more relevant for 1.6.x line, but not sure if 
we need to backport it. Will mark it as closed now.  Plz let me know and I can 
close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14521][SQL] StackOverflowError in Kryo ...

2016-04-26 Thread rajeshbalamohan

Github user rajeshbalamohan closed the pull request at:

https://github.com/apache/spark/pull/12514


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14752][SQL] LazilyGenerateOrdering thro...

2016-04-25 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/12661

[SPARK-14752][SQL] LazilyGenerateOrdering throws NullPointerException

## What changes were proposed in this pull request?
LazilyGenerateOrdering throws NullPointerException when clubbed with 
TakeOrderedAndProjectExec. This causes simple queries like "select i_item_id 
from item order by i_item_id limit 10;" would fail in spark-sql. When 
deserializing in DirectTaskResult, it goes through nested structure in Kryo 
causing NPE for generatedOrdering.


## How was this patch tested?
Manual testing by running multiple SQL queries in multi node cluster.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-14752

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12661.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12661


commit e83c9bc87acc794ca3e9a37c999c05550d425e2b
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-04-25T13:09:58Z

[SPARK-14752][SQL] LazilyGenerateOrdering throws NullPointerException with 
TakeOrderedAndProject




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-24 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12319#issuecomment-214132553
  
Thanks @liancheng , @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14387][SQL] Exceptions thrown when quer...

2016-04-22 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12293#issuecomment-213275126
  
\cc @liancheng


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14521][SQL] StackOverflowError in Kryo ...

2016-04-21 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12514#issuecomment-213215983
  
sure @yzhou2001 . Please go ahead. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14387][SQL] Exceptions thrown when quer...

2016-04-21 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12293#issuecomment-213207842
  
Changes:
- Rebased patch to master branch
- Removed OrcTableScan as it is not used anywhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-21 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12319#issuecomment-212766510
  
Thanks for the review @liancheng 
Latest commit addresses the review comments. Changes are as follows
- Moved OrcRecordReader changes to SparkOrcNewRecordReader in spark-hive
- Removed pom.xml related changes
- Fixed styling issues.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-20 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12319#issuecomment-212393980
  
Thanks for the review @liancheng . Should i create separate PR for 
OrcRecordReader in https://github.com/pwendell/hive/tree/release-1.2.1-spark 
providing reference to this ticket?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14521][SQL] StackOverflowError in Kryo ...

2016-04-20 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12514#issuecomment-212392651
  
Sure, will check on removing the circular reference. Took the reference 
tracking approach, as it is enabled by default with Spark's KryoSerializer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14521][SQL] StackOverflowError in Kryo ...

2016-04-19 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12514#issuecomment-212191093
  
\cc @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14521][SQL] StackOverflowError in Kryo ...

2016-04-19 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/12514

[SPARK-14521][SQL] StackOverflowError in Kryo when executing TPC-DS Qâ¦

## What changes were proposed in this pull request?
Observed stackOverflowError in Kryo when executing TPC-DS Query27. Spark 
thrift server disables kryo reference tracking (if not specified in conf). When 
"spark.kryo.referenceTracking" is set to true explicitly in 
spark-defaults.conf, query executes successfully. Recent changes HashedRelation 
could have introduced loops which would need 
"spark.kryo.referenceTracking=true" in spark-thrift server.  This PR addresses 
this by setting referenceTracking to true in SparkSQLEnv

## How was this patch tested?
Manually running TPC-DS queries at 200 GB scale in multi node cluster.  
Also ran 
org.apache.spark.sql.hive.execution.HiveCompatibilitySuite,org.apache.spark.sql.hive.execution.HiveQuerySuite,org.apache.spark.sql.hive.execution.PruningSuite,org.apache.spark.sql.hive/CachedTableSuite


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-14521

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12514.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12514


commit 59875f424aaf60aa90ca5a1006df8b5e20d4f83a
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-04-20T00:59:32Z

[SPARK-14521][SQL] StackOverflowError in Kryo when executing TPC-DS Query27




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...

2016-04-18 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12105#issuecomment-211286426
  
In the generated code, it returns null if constFormat == null.  So it is 
not required to change the generated code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...

2016-04-17 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/11978#issuecomment-211175800
  
@srowen  - As per andrew's comment, I thought it was fine to make the 
change given that HadoopRDD is marked as DeveloperAPI.  Please let me know if 
any additional changes are needed.

Additional info: Huge amount of changes in SPARK-13664 for 
FileSourceStrategy which is marked as the default codepath.  So ideally, 
OrcRelation would no longer go via this codepath by default. Given that, this 
PR would have an impact if someone is trying to directly invoke HadoopRDD and 
has done closure clearing upfront.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...

2016-04-17 Thread rajeshbalamohan

Github user rajeshbalamohan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12105#discussion_r60001514
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -368,7 +369,10 @@ abstract class UnixTime extends BinaryExpression with 
ExpectsInputTypes {
   t.asInstanceOf[Long] / 100L
 case StringType if right.foldable =>
   if (constFormat != null) {
-Try(new SimpleDateFormat(constFormat.toString).parse(
+if (formatter == null) {
+  formatter = Try(new 
SimpleDateFormat(constFormat.toString)).getOrElse(null)
--- End diff --

Didn't want to throw the error back as it would break the earlier 
functionality.  Eearlier it was returning null when any exception (i.e, could 
be constFormat being null, or parsing error) was thrown.  

Creating the formatter upfront in the recent commit and handling null 
earlier itself, to have minimal changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...

2016-04-15 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12105#issuecomment-210384204
  
Revised the patch addressing comments. Fixed eval() of UnixTime, 
FromUnixTime.  Haven't changed eval in DateFormatClass as i am not sure if 
format can change in between. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...

2016-04-14 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12105#issuecomment-210202377
  
Sorry about the delay. I will share the update patch today


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-12 Thread rajeshbalamohan

Github user rajeshbalamohan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12319#discussion_r59328799
  
--- Diff: 
sql/core/src/main/java/org/apache/hadoop/hive/ql/io/orc/OrcRecordReader.java ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io.orc;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+import java.io.IOException;
+import java.util.List;
+
+public class OrcRecordReader extends RecordReader<NullWritable, OrcStruct> 
{
--- End diff --

Sure. This is based on OrcNewInputFormat.OrcRecordReader (which is marked 
private). Only addition is the getObjectInspector targeted to reduce namenode 
calls later.  I will update the doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-11 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12319#issuecomment-208694717
  
Sure @rxin. makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14551. [SQL] Reduce number of NN calls i...

2016-04-11 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/12319

SPARK-14551. [SQL] Reduce number of NN calls in OrcRelation with Fileâ¦

## What changes were proposed in this pull request?
When FileSourceStrategy is used, record reader is created which incurs a NN 
call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading the 
file information to get the ObjectInspector. This incurs additional NN call. It 
would be good to avoid this additional NN call (specifically for partitioned 
datasets).  

Added OrcRecordReader which is very similar to 
OrcNewInputFormat.OrcRecordReader with an option of exposing the 
ObjectInspector. This eliminates the need to look up the file later for 
generating the object inspector. This would be specifically be useful for 
partitioned tables/datasets.

## How was this patch tested?
Ran tpc-ds queries manually and also verified by running 
org.apache.spark.sql.hive.orc.OrcSuite,org.apache.spark.sql.hive.orc.OrcQuerySuite,org.apache.spark.sql.hive.orc.OrcPartitionDiscoverySuite,OrcPartitionDiscoverySuite.OrcHadoopFsRelationSuite,org.apache.spark.sql.hive.execution.HiveCompatibilitySuite



â¦SourceStrategy mode

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-14551

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12319.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12319


commit 1b99d95e3361ed526a93abc6a3e1c93e6de578de
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-04-12T02:30:26Z

SPARK-14551. [SQL] Reduce number of NN calls in OrcRelation with 
FileSourceStrategy mode




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14387. [SQL] Exceptions thrown when quer...

2016-04-11 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/12293

SPARK-14387. [SQL] Exceptions thrown when querying ORC tables stored â¦

## What changes were proposed in this pull request?
Physical files stored in Hive as ORC would have internal columns as 
_col1,_col2 etc and column mapping would be available in HiveMetastore.  It was 
possible to query ORC tables stored in Hive via Spark's beeline client in 
earlier branches, and with master branch this was broken. When reading ORC 
files, it would be good map hive schema to physical schema for supporting 
backward compatibility.  This PR addresses this issue.


## How was this patch tested?
Manual execution of TPC-DS queries at 200 GB scale.


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)


â¦in Hive

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-14387

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12293.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12293


commit 1bc4e98ff19e76a2302003268c7a2c374647aad3
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-04-11T05:53:24Z

SPARK-14387. [SQL] Exceptions thrown when querying ORC tables stored in Hive




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...

2016-04-03 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12105#issuecomment-205122837
  
Agreed. Thanks @srowen . Reverted calendar changes in DateTimeUtils in 
recent commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...

2016-04-03 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/12105#issuecomment-205082821
  
SDF declared in the generated code is not shared in multiple threads.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...

2016-04-01 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/11978#issuecomment-204307805
  
@andrewor14  - Not sure if I understood your last comment.  Currently no 
direct invocation to  HadoopRDD (with initLocalJobConfFuncOpt) is made in 
Spark. Later point in time, if change is needed to invoke HadoopRDD (with 
initLocalJobConfFuncOpt)  via SparkContext, following method could be added 
which cleans up the function.  

```
def hadoopRDD[K, V](
   broadcastedConf: Broadcast[SerializableConfiguration],
   initLocalJobConfFuncOpt: Option[JobConf => Unit],
   inputFormatClass: Class[_ <: InputFormat[K, V]],
   keyClass: Class[K],
   valueClass: Class[V],
   minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
clean(initLocalJobConfFuncOpt)
new HadoopRDD(this, broadcastedConf, initLocalJobConfFuncOpt,
  inputFormatClass, keyClass, valueClass, minPartitions)
  }
```

But, I am not sure whether we need to clean sc.hadoopRDD in this patch. 
Please let me know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14321. [SQL] Reduce date format cost and...

2016-03-31 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/12105

SPARK-14321. [SQL]  Reduce date format cost and string-to-date cost iâ¦

## What changes were proposed in this pull request?

Here is the generated code snippet when executing date functions. 
SimpleDateFormat is fairly expensive and can show up bottleneck when processing 
millions of records. It would be better to instantiate it once.

```
/* 066 */ UTF8String primitive5 = null;
/* 067 */ if (!isNull4) {
/* 068 */   try {
/* 069 */ primitive5 = UTF8String.fromString(new 
java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
/* 070 */ new java.util.Date(primitive7 * 1000L)));
/* 071 */   } catch (java.lang.Throwable e) {
/* 072 */ isNull4 = true;
/* 073 */   }
/* 074 */ }
```

With modified code, here is the generated code
```
/* 010 */   private java.text.SimpleDateFormat sdf2;
/* 011 */   private UnsafeRow result13;
/* 012 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder bufferHolder14;
/* 013 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter15;
/* 014 */
...
...
/* 065 */ boolean isNull0 = isNull3;
/* 066 */ UTF8String primitive1 = null;
/* 067 */ if (!isNull0) {
/* 068 */   try {
/* 069 */ if (sdf2 == null) {
/* 070 */   sdf2 = new java.text.SimpleDateFormat("-MM-dd 
HH:mm:ss");
/* 071 */ }
/* 072 */ primitive1 = UTF8String.fromString(sdf2.format(
/* 073 */ new java.util.Date(primitive4 * 1000L)));
/* 074 */   } catch (java.lang.Throwable e) {
/* 075 */ isNull0 = true;
/* 076 */   }
/* 077 */ }
```

Similarly Calendar.getInstance was used in DateTimeUtils which can be 
lazily inited.


## How was this patch tested?


org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite,org.apache.spark.sql.catalyst.util.DateTimeUtilsSuite
Also tried with couple of sample SQL queries with single executor (6GB) 
which showed good improvement with the fix.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-14321

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12105


commit 6fd07db11b5c9eed795dde11177f1c245a6fef16
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-04-01T02:41:07Z

SPARK-14321. [SQL]  Reduce date format cost and string-to-date cost in date 
functions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...

2016-03-31 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/11978#issuecomment-204200260
  
Thanks @andrewor14 . Addressed your review comments in latest commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...

2016-03-27 Thread rajeshbalamohan

Github user rajeshbalamohan commented on a diff in the pull request:

https://github.com/apache/spark/pull/11978#discussion_r57537799
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -979,6 +979,7 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
 // A Hadoop configuration can be about 10 KB, which is pretty big, so 
broadcast it.
 val confBroadcast = broadcast(new 
SerializableConfiguration(hadoopConfiguration))
 val setInputPathsFunc = (jobConf: JobConf) => 
FileInputFormat.setInputPaths(jobConf, path)
+clean(setInputPathsFunc)
--- End diff --

Thanks @srowen. Yes, for invocations via sc.textFile.  Adding additional 
method like following and passing initLocalJobConfFuncOpt to it can help avoid 
closure cleaning in this scenario.  However, this would call for changes in all 
other places where sc.textFile is invoked.  Intension was to allow user to make 
use of HadoopRDD directly (if needed) without having to incur the cost of 
closure cleaning (e.g in sql modules). Hence did not make those additional 
changes.

```
  def newTextFile(
  path: String,
  initLocalJobConfFuncOpt: Option[JobConf => Unit],
  minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], initLocalJobConfFuncOpt,
  classOf[LongWritable], classOf[Text],
  minPartitions).map(pair => pair._2.toString).setName(path)
  }


  def hadoopFile[K, V](
  path: String,
  inputFormatClass: Class[_ <: InputFormat[K, V]],
  initLocalJobConfFuncOpt: Option[JobConf => Unit],
  keyClass: Class[K],
  valueClass: Class[V],
  minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
// A Hadoop configuration can be about 10 KB, which is pretty big, so 
broadcast it.
val confBroadcast = broadcast(new 
SerializableConfiguration(hadoopConfiguration))
new HadoopRDD(
  this,
  confBroadcast,
  initLocalJobConfFuncOpt,
  inputFormatClass,
  keyClass,
  valueClass,
  minPartitions).setName(path)
  }

e.g
  sc.newTextFile(tmpFilePath, Some(setInputPathsFunc), 4).count()
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...

2016-03-27 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/11978#issuecomment-202048734
  
Tested with following suites along with the earlier sql suites 

org.apache.spark.FileSuite
org.apache.spark.SparkContextSuite
org.apache.spark.graphx.GraphLoaderSuite
org.apache.spark.graphx.lib.SVDPlusPlusSuite
org.apache.spark.metrics.InputOutputMetricsSuite
org.apache.spark.ml.PipelineSuite
org.apache.spark.ml.classification.DecisionTreeClassifierSuite
org.apache.spark.ml.classification.LogisticRegressionSuite
org.apache.spark.ml.classification.MultilayerPerceptronClassifierSuite
org.apache.spark.ml.classification.NaiveBayesSuite
org.apache.spark.ml.clustering.KMeansSuite
org.apache.spark.ml.clustering.LDASuite
org.apache.spark.ml.evaluation.BinaryClassificationEvaluatorSuite
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluatorSuite
org.apache.spark.ml.evaluation.RegressionEvaluatorSuite
org.apache.spark.ml.feature.BinarizerSuite
org.apache.spark.ml.feature.BucketizerSuite
org.apache.spark.ml.feature.ChiSqSelectorSuite
org.apache.spark.ml.feature.CountVectorizerSuite
org.apache.spark.ml.feature.DCTSuite
org.apache.spark.ml.feature.ElementwiseProductSuite
org.apache.spark.ml.feature.HashingTFSuite
org.apache.spark.ml.feature.IDFSuite
org.apache.spark.ml.feature.InteractionSuite
org.apache.spark.ml.feature.MaxAbsScalerSuite
org.apache.spark.ml.feature.MinMaxScalerSuite
org.apache.spark.ml.feature.NGramSuite
org.apache.spark.ml.feature.NormalizerSuite
org.apache.spark.ml.feature.OneHotEncoderSuite
org.apache.spark.ml.feature.PCASuite
org.apache.spark.ml.feature.PolynomialExpansionSuite
org.apache.spark.ml.feature.QuantileDiscretizerSuite
org.apache.spark.ml.feature.RFormulaSuite
org.apache.spark.ml.feature.RegexTokenizerSuite
org.apache.spark.ml.feature.SQLTransformerSuite
org.apache.spark.ml.feature.StandardScalerSuite
org.apache.spark.ml.feature.StopWordsRemoverSuite
org.apache.spark.ml.feature.StringIndexerSuite
org.apache.spark.ml.feature.TokenizerSuite
org.apache.spark.ml.feature.VectorAssemblerSuite
org.apache.spark.ml.feature.VectorIndexerSuite
org.apache.spark.ml.feature.VectorSlicerSuite
org.apache.spark.ml.feature.Word2VecSuite
org.apache.spark.ml.recommendation.ALSSuite
org.apache.spark.ml.regression.AFTSurvivalRegressionSuite
org.apache.spark.ml.regression.DecisionTreeRegressorSuite
org.apache.spark.ml.regression.GeneralizedLinearRegressionSuite
org.apache.spark.ml.regression.IsotonicRegressionSuite
org.apache.spark.ml.regression.LinearRegressionSuite
org.apache.spark.ml.source.libsvm.LibSVMRelationSuite
org.apache.spark.ml.tuning.CrossValidatorSuite
org.apache.spark.ml.util.DefaultReadWriteSuite
org.apache.spark.mllib.classification.LogisticRegressionSuite
org.apache.spark.mllib.classification.NaiveBayesSuite
org.apache.spark.mllib.classification.SVMSuite
org.apache.spark.mllib.clustering.GaussianMixtureSuite
org.apache.spark.mllib.clustering.KMeansSuite
org.apache.spark.mllib.clustering.LDASuite
org.apache.spark.mllib.clustering.PowerIterationClusteringSuite
org.apache.spark.mllib.feature.ChiSqSelectorSuite
org.apache.spark.mllib.feature.Word2VecSuite
org.apache.spark.mllib.fpm.FPGrowthSuite
org.apache.spark.mllib.recommendation.MatrixFactorizationModelSuite
org.apache.spark.mllib.regression.IsotonicRegressionSuite
org.apache.spark.mllib.regression.LassoSuite
org.apache.spark.mllib.regression.LinearRegressionSuite
org.apache.spark.mllib.regression.RidgeRegressionSuite
org.apache.spark.mllib.tree.DecisionTreeSuite
org.apache.spark.mllib.tree.GradientBoostedTreesSuite
org.apache.spark.mllib.tree.RandomForestSuite
org.apache.spark.mllib.util.MLUtilsSuite
org.apache.spark.rdd.HadoopRDD,
org.apache.spark.rdd.MapPartitionsRDD,
org.apache.spark.rdd.PairRDDFunctionsSuite
org.apache.spark.repl.ReplSuite
org.apache.spark.sql.execution.datasources.csv.CSVSuite
org.apache.spark.sql.execution.datasources.json.JsonSuite


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14113. Consider marking JobConf closure-...

2016-03-26 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/11978

SPARK-14113. Consider marking JobConf closure-cleaning in HadoopRDD aâ¦

## What changes were proposed in this pull request?

In HadoopRDD, the following code was introduced as a part of SPARK-6943.

``
  if (initLocalJobConfFuncOpt.isDefined) {
sparkContext.clean(initLocalJobConfFuncOpt.get)
  }
``

Passing initLocalJobConfFuncOpt to HadoopRDD incurs good performance 
penalty (due to closure cleaning) with large number of RDDs. This would be 
invoked for every HadoopRDD initialization causing the bottleneck.

example threadstack is given below

``
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.readUTF8(Unknown Source)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:402)
at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:390)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at 
scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:390)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:224)
at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:223)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:223)
at 
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2079)
at org.apache.spark.rdd.HadoopRDD.(HadoopRDD.scala:112)
``

This PR request does the following

1. Remove the closure cleaning in HadoopRDD init, which was mainly added to 
check if HadoopRDD can be made serializable or not.
2. Directly instantiate HadoopRDD in OrcRelation, instead of going via 
SparkContext.hadoopRDD (which internally invokes threaddump in "withScope"). 
Clubbing this change instead of making a separate ticket for this minor change.


## How was this patch tested?

No new tests have been added. Used the following code to measure overhead 
the HadoopRDD init codepath. With patch it took 340ms as opposed to 4815ms 
without patch.

Also tested with number of queries from TPC-DS in multi node environment. 
Along with, ran the following unit tests 
org.apache.spark.sql.hive.execution.HiveCompatibilitySuite,org.apache.spark.sql.hive.execution.HiveQuerySuite,org.apache.spark.sql.hive.execution.PruningSuite,org.apache.spark.sql.hive.CachedTableSuite,org.apache.spark.rdd.RDDOperationScopeSuite,org.apache.spark.ui.jobs.JobProgressListenerSuite

``
  test("Check timing for HadoopRDD init") {
val start: Long = System.currentTimeMillis();

val initializeJobConfFunc = 
HadoopTableReader.initializeLocalJobConfFunc ("", null) _
Utils.withDummyCallSite(sqlContext.sparkContext) {
  // Large tables end up creating 5500 RDDs
  for(i <- 1 to 5500) {
// ignore nulls in RDD as its mainly for testing timing of RDD 
creation
val testRDD = new HadoopRDD(sqlContext.sparkContext, null, 
Some(initializeJobConfFunc),
  null, classOf[NullWritable], classOf[Writable], 10)
  }
}
val end: Long = System.currentTimeMillis();
println("Time taken : " + (end - start))
  }
``

Without Patch: (time taken to init 5000 HadoopRDD)

[GitHub] spark pull request: SPARK-14091 [core] Consider improving performa...

2016-03-23 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/11911#issuecomment-200552367
  
Thanks @JoshRosen and @srowen . Retested with "lazy val" which has the same 
perf improvement. Added "lazy val" in latest commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14091 [core] Consider improving performa...

2016-03-23 Thread rajeshbalamohan

Github user rajeshbalamohan commented on a diff in the pull request:

https://github.com/apache/spark/pull/11911#discussion_r57150734
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -1745,11 +1745,16 @@ class SparkContext(config: SparkConf) extends 
Logging with ExecutorAllocationCli
* has overridden the call site using `setCallSite()`, this will return 
the user's version.
*/
   private[spark] def getCallSite(): CallSite = {
-val callSite = Utils.getCallSite()
-CallSite(
-  
Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
-  
Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
-)
+var (shortForm, longForm) = (getLocalProperty(CallSite.SHORT_FORM),
+  getLocalProperty(CallSite.LONG_FORM))
+
+if (shortForm == null || longForm == null) {
+  val callSite = Utils.getCallSite()
+  shortForm = callSite.shortForm
--- End diff --

Thanks @srowen.  In Utils.withDummyCallSite(), both LONG_FORM and 
SHORT_FORM are explicitly set to "". But I can see that it is possible to 
explicitly set one of them via setCallSite(shortCallSite).
Incorporated your review comments in latest commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14091 [core] Consider improving performa...

2016-03-23 Thread rajeshbalamohan

Github user rajeshbalamohan commented on a diff in the pull request:

https://github.com/apache/spark/pull/11911#discussion_r57148521
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -1745,10 +1745,11 @@ class SparkContext(config: SparkConf) extends 
Logging with ExecutorAllocationCli
* has overridden the call site using `setCallSite()`, this will return 
the user's version.
*/
   private[spark] def getCallSite(): CallSite = {
-val callSite = Utils.getCallSite()
 CallSite(
-  
Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
-  
Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
+  Option(getLocalProperty(CallSite.SHORT_FORM))
--- End diff --

Thanks @srowen . Incorporated the review comments in the latest commit. 
Please review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-14091 [core] Consider improving performa...

2016-03-22 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/11911

SPARK-14091 [core] Consider improving performance of SparkContext.getâ¦

## What changes were proposed in this pull request?
Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().

 private[spark] def getCallSite(): CallSite = {
val callSite = Utils.getCallSite()
CallSite(
  
Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
  
Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
)
  }

However, in some places utils.withDummyCallSite(sc) is invoked to avoid 
expensive threaddumps within getCallSite(). But Utils.getCallSite() is 
evaluated earlier causing threaddumps to be computed. This would impact when 
lots of RDDs are created (e.g spends close to 3-7 seconds when 1000+ are RDDs 
are present, which can have significant impact when entire query runtime is in 
the order of 10-20 seconds)
Creating this jira to consider evaluating getCallSite only when needed.


## How was this patch tested?
No new test cases are added. Following standalone test was tried out 
manually. Also, built entire spark binary and tried with few SQL queries in 
TPC-DS  and TPC-H in multi node cluster

def run(): Unit = {
val conf = new SparkConf()
val sc = new SparkContext("local[1]", "test-context", conf)
val start: Long = System.currentTimeMillis();
val confBroadcast = sc.broadcast(new SerializableConfiguration(new 
Configuration()))
Utils.withDummyCallSite(sc) {
  //Large tables end up creating 5500 RDDs
  for(i <- 1 to 5000) {
val testRDD = new HadoopRDD(sc, confBroadcast, None, null,
  classOf[NullWritable], classOf[Writable], 10)
  }
}
val end: Long = System.currentTimeMillis();
println("Time taken : " + (end - start))
  }
  
def main(args: Array[String]): Unit = {
run
  }


(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)


â¦CallSite() (rbalamohan)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rajeshbalamohan/spark SPARK-14091

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11911.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11911


commit dba630b854d6fdb298f8ef7ed25acf497f0eeebe
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-03-23T04:57:01Z

SPARK-14091 [core] Consider improving performance of 
SparkContext.getCallSite() (rbalamohan)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12925. Improve HiveInspectors.unwrap for...

2016-03-03 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/11477#issuecomment-192146953
  
Thanks @srowen . Incorporated the changes. 

This was tested with HiveCompatibilitySuite, HiveQuerySuite. These tests 
ran fine in master branch without the changes as well. However, when tried with 
1.6 branch, these test suites failed with the copy issues. Hence doing explicit 
bytes copy in master, so that this does not fail in future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12925. Improve HiveInspectors.unwrap for...

2016-03-02 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/11477

SPARK-12925. Improve HiveInspectors.unwrap for StringObjectInspector.â¦

Earlier fix did not copy the bytes and it is possible for higher level to 
reuse Text object. This was causing issues. Proposed fix now copies the bytes 
from Text. This still avoids the expensive encoding/decoding

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-12925.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11477.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11477


commit d46d41ea75ecfeaef208f6c54222f23c24ebd3b0
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-03-02T23:39:21Z

SPARK-12925. Improve HiveInspectors.unwrap for 
StringObjectInspector.getPrimitiveWritableObject. Earlier fix did not copy the 
bytes and it is possible for higher level to reuse Text object. This was 
causing issues. Proposed fix now copies the bytes from Text. This still avoids 
the expensive encoding/decoding




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...

2016-02-17 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/10842#issuecomment-185519589
  
Closed #10375 which was the dup of this pull request. Review can be done on 
this pull request.  Thanks @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...

2016-02-17 Thread rajeshbalamohan

Github user rajeshbalamohan closed the pull request at:

https://github.com/apache/spark/pull/10375


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...

2016-02-17 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/10375#issuecomment-185519219
  
Closing this as dup of #10842 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...

2016-02-17 Thread rajeshbalamohan

GitHub user rajeshbalamohan reopened a pull request:

https://github.com/apache/spark/pull/10842

SPARK-12417. [SQL] Orc bloom filter options are not propagated duringâ¦

Add option to have bloom filters in ORC write codepath.  Added changes to 
apply cleanly in master branch.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-12417

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10842.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10842


commit 8615764056bc9039933ca97d85564cf60097fb5a
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-01-19T04:16:28Z

SPARK-12417. [SQL] Orc bloom filter options are not propagated during file




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...

2016-02-17 Thread rajeshbalamohan

Github user rajeshbalamohan closed the pull request at:

https://github.com/apache/spark/pull/10842


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12920. [SQL]. Spark thrift server can ru...

2016-02-15 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/10846#issuecomment-184185874
  
@JoshRosen - Can you please let me know on proceeding with this patch?. 
Patch reduces the CPU utilization of spark-thrift server in multi-user 
environment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12948. [SQL]. Consider reducing size of ...

2016-01-27 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/10861#issuecomment-175584028
  
@JoshRosen - Please let me know if my latest comment on the usecase 
addresses your question. Can you.

>>
may be worth a holistic design review because I think there are some hacks 
in Spark SQL to address this problem there and it would be nice to have a 
unified solution for this
>>

Can you plz provide more details/pointers on this?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12998 [SQL]. Enable OrcRelation when con...

2016-01-26 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/10938

SPARK-12998 [SQL]. Enable OrcRelation when connecting via spark thrifâ¦

When a user connects via spark-thrift server to execute SQL, it does not 
enable PPD with ORC. It ends up creating MetastoreRelation which does not have 
ORC PPD. Purpose of this JIRA is to convert MetastoreRelation to OrcRelation in 
HiveMetastoreCatalog, so that users can benefit from PPD even when connecting 
to spark-thrift server.
For e.g, Query "select count(1) from  tpch_flat_orc_1000.lineitem where 
l_shipdate = '1990-04-18'" which is fired against spark-thrift-server or 
sqlContext would end up using "OrcRelation" to make use of PPD instead of 
MetastoreRelation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark spark-12998

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10938.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10938


commit 1a5b164153df946d713b34727a001d5005479d1a
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-01-27T02:19:53Z

SPARK-12998 [SQL]. Enable OrcRelation when connecting via spark thrift 
server.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12948. [SQL]. Consider reducing size of ...

2016-01-24 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/10861#issuecomment-174355176
  
**Usecase**: User tries to map the dataset which is partitioned (e.g TPC-DS 
dataset at 200 GB scale) & runs a query in spark-shell. 

E.g
...
val o_store_sales = 
sqlContext.read.format("orc").load("/tmp/spark_tpcds_bin_partitioned_orc_200/store_sales")
o_store_sales.registerTempTable("o_store_sales")
..
sqlContext.sql("SELECT..").show();
...


When this is executed, OrcRelation creates Config objects for every 
partition (Ref: 
[OrcRelation.execute()](https://github.com/apache/spark/blob/e14817b528ccab4b4685b45a95e2325630b5fc53/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala#L295)).
 In the case of TPC-DS, it generates 1826 partitions. This info is broadcasted 
in 
[DAGScheduler#submitMissingTasks()](https://github.com/apache/spark/blob/1b2c2162af4d5d2d950af94571e69273b49bf913/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1010).
  As a part of this, the configurations created for 1826 partitions are also 
streamed through (i.e embedded in HadoopMapParitionsWithSplitRDD -->f()--> 
wrappedConf).  Each of these configuration takes around 251 KB per partition.  
Please refer to the profiler snapshot attached in the JIRA 
([mem_snap_shot](https://issues.apache.org/jira/secure/attachment/12784080/SPARK-12948.mem.prof.snapshot.png)).
 This causes quite a bit of delay in the overall job runtim
 e. 

Patch reuses the already broadcastedconf from SparkContext.  fillObject() 
function is executed later for every partition, which internally sets up any 
additional config details. This drastically reduces the amount of payload that 
is broadcasted and helps in reducing the overall job runtime.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12920. [SQL]. Spark thrift server can ru...

2016-01-21 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/10846#issuecomment-173741417
  
Thanks @JoshRosen . The current patch is based on flagging approach (in 
case of retaining caching) which would be safe for 1.6.x.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12948. [SQL]. Consider reducing size of ...

2016-01-20 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/10861

SPARK-12948. [SQL]. Consider reducing size of broadcasts in OrcRelation

Size of broadcasted data in OrcRelation was significantly higher when 
running query with large number of partitions (e.g TPC-DS). And it has an 
impact on the job runtime. This would be more evident when there is large 
number of partitions/splits. Profiler snapshot is attached in SPARK-12948 
(https://issues.apache.org/jira/secure/attachment/12783513/SPARK-12948_cpuProf.png).
 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-12948

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10861.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10861


commit 4da7a22d2195c77e27aa4f3aa957b1fdc0d57f5b
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-01-21T05:09:09Z

SPARK-12948. [SQL]. Consider reducing size of broadcasts in OrcRelation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12920. [SQL]. Spark thrift server can ru...

2016-01-20 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/10846

SPARK-12920. [SQL]. Spark thrift server can run at very high CPU withâ¦

Spark thrift server runs at very high CPU when concurrent users submit 
queries to the system over a period of time.  Profiling revealed it is due to 
many Conf objects getting cacheed in HadoopRDD and this causing lots of 
pressure for GC when running queries on datasets with large number of 
partitions. Also, job UI retention causes issues with large jobs.  Details are 
mentioned in SPARK-12920 and profiler snapshots are attached.  Fix introduces 
"spark.hadoop.cacheConf" to optionally cache the jobConf in HadoopRDD. 
JobProgressListener fixes are related to trimming the job/stage details.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-12920

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10846.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10846


commit 12d77beb0c6c4e3f336aab43a5907775da715753
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-01-20T09:14:36Z

SPARK-12920. [SQL]. Spark thrift server can run at very high CPU with 
concurrent users




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12925. [SQL]. Improve HiveInspectors.unw...

2016-01-20 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/10848

SPARK-12925. [SQL]. Improve HiveInspectors.unwrap for StringObjectInsâ¦

Text is in UTF-8 and converting it via "UTF8String.fromString" incurs 
decoding and encoding, which turns out to be expensive and redundant.  Profiler 
snapshot details is attached in the JIRA 
(ref:https://issues.apache.org/jira/secure/attachment/12783331/SPARK-12925_profiler_cpu_samples.png)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-12925

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10848.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10848


commit 30cc93246828da9728891f9ed6d65b26bcb888af
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-01-20T12:15:56Z

SPARK-12925. [SQL]. Improve HiveInspectors.unwrap for 
StringObjectInspector.getPrimitiveWritableObject




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12898. Consider having dummyCallSite for...

2016-01-19 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/10825#issuecomment-173037438
  
getCallSite gets the thread stack trace (+ additional processing). This is 
executed numerous number of times when running a query on TPC-DS  (with 1800+ 
partition files) dataset. I have attached the profiler info in SPARK-12898 
(https://issues.apache.org/jira/secure/attachment/12783232/callsiteProf.png).  
Having dummycallsite eliminates these stracktrace calls and reduces the overall 
job runtime.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...

2016-01-19 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/10842

SPARK-12417. [SQL] Orc bloom filter options are not propagated duringâ¦

Add option to have bloom filters in ORC write codepath.  Added changes to 
apply cleanly in master branch.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-12417

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10842.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10842


commit 8615764056bc9039933ca97d85564cf60097fb5a
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-01-19T04:16:28Z

SPARK-12417. [SQL] Orc bloom filter options are not propagated during file




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12898. Consider having dummyCallSite for...

2016-01-19 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/10825#issuecomment-173106715
  
Thanks for review. I have added a comment in the code for the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12898. Consider having dummyCallSite for...

2016-01-18 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/10825

SPARK-12898. Consider having dummyCallSite for HiveTableScan

Currently, HiveTableScan runs with getCallSite which is really expensive 
and shows up when scanning through large table with partitions (e.g TPC-DS) 
which slows down the overall runtime of the job. It would be good to consider 
having dummyCallSite in HiveTableScan.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark SPARK-12898

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10825.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10825


commit 3a32561eb905b236014cad74472c3a8c359b1aa0
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2016-01-19T04:27:52Z

SPARK-12898. Consider having dummyCallSite for HiveTableScan




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...

2015-12-21 Thread rajeshbalamohan

Github user rajeshbalamohan commented on the pull request:

https://github.com/apache/spark/pull/10375#issuecomment-166486869
  
Thanks @zhzhan. Enabled orc PPD by default and also added a test case for 
bloom filters in the latest commit. ORC RecordReaderImpl is not public in the 
version of hive that is supported in spark; Hence relying on FileDump utility 
from ORC to test bloom fitlers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-12417. [SQL] Orc bloom filter options ar...

2015-12-18 Thread rajeshbalamohan

GitHub user rajeshbalamohan opened a pull request:

https://github.com/apache/spark/pull/10375

SPARK-12417. [SQL] Orc bloom filter options are not propagated during file 
â¦



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajeshbalamohan/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10375.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10375


commit 72436a94720bc73ff617a83337a321586c9a4de2
Author: Rajesh Balamohan <rbalamo...@apache.org>
Date:   2015-12-18T10:08:05Z

SPARK-12417. Orc bloom filter options are not propagated during file write 
in spark




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

100 matches

Mail list logo