[jira] [Updated] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11105:

Target Version/s:   (was: 1.5.1)

> Disitribute the log4j.properties files from the client to the executors
> ---
>
> Key: SPARK-11105
> URL: https://issues.apache.org/jira/browse/SPARK-11105
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Srinivasa Reddy Vundela
>Priority: Minor
>
> The log4j.properties file from the client is not distributed to the 
> executors. This means that the client settings are not applied to the 
> executors and they run with the default settings.
> This affects troubleshooting and data gathering.
> The workaround is to use the --files option for spark-submit to propagate the 
> log4j.properties file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-10-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957569#comment-14957569
 ] 

Reynold Xin commented on SPARK-10577:
-

I also backported this into branch-1.5 so this can be included in 1.5.2.


> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Assignee: Jian Feng Zhang
>  Labels: starter
> Fix For: 1.5.2, 1.6.0
>
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10577:

Fix Version/s: 1.5.2

> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Assignee: Jian Feng Zhang
>  Labels: starter
> Fix For: 1.5.2, 1.6.0
>
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10845) SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10845.
-
Resolution: Fixed

I backported it.


> SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"
> -
>
> Key: SPARK-10845
> URL: https://issues.apache.org/jira/browse/SPARK-10845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: backport-needed
> Fix For: 1.5.2, 1.6.0
>
>
> When refactoring SQL options from plain strings to the strongly typed 
> {{SQLConfEntry}}, {{spark.sql.hive.version}} wasn't migrated, and doesn't 
> show up in the result of {{SET -v}}, as {{SET -v}} only shows public 
> {{SQLConfEntry}} instances.
> This affects compatibility with Simba ODBC driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10845) SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10845:

Fix Version/s: 1.5.2

> SQL option "spark.sql.hive.version" doesn't show up in the result of "SET -v"
> -
>
> Key: SPARK-10845
> URL: https://issues.apache.org/jira/browse/SPARK-10845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: backport-needed
> Fix For: 1.5.2, 1.6.0
>
>
> When refactoring SQL options from plain strings to the strongly typed 
> {{SQLConfEntry}}, {{spark.sql.hive.version}} wasn't migrated, and doesn't 
> show up in the result of {{SET -v}}, as {{SET -v}} only shows public 
> {{SQLConfEntry}} instances.
> This affects compatibility with Simba ODBC driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-10-14 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957554#comment-14957554
 ] 

Glenn Strycker commented on SPARK-6235:
---

I don't think so, but I can check.  My RDD came from an RDD of type (K,V) that 
was partitioned by key and worked just fine... my new RDD that is failing is 
attempting to map the value V to the K, so that (V, K) is now going to be 
partitioned by the value (now the key) instead.  So I can try running some 
checks of multiplicity to see if my values have some kind of skew... 
unfortunately most of those checks are going to involve reduceByKey-like 
operations that will probably result in 2GB failures themselves... I was hoping 
to get the mapping and partitioning of (K,V) -> (V,K) accomplished first before 
running such checks.  Thanks for the suggestion, though!

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-10-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957546#comment-14957546
 ] 

Reynold Xin commented on SPARK-6235:


Is your data skewed? i.e. maybe there is a single key that's enormous?


> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11105:


Assignee: (was: Apache Spark)

> Disitribute the log4j.properties files from the client to the executors
> ---
>
> Key: SPARK-11105
> URL: https://issues.apache.org/jira/browse/SPARK-11105
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Srinivasa Reddy Vundela
>Priority: Minor
>
> The log4j.properties file from the client is not distributed to the 
> executors. This means that the client settings are not applied to the 
> executors and they run with the default settings.
> This affects troubleshooting and data gathering.
> The workaround is to use the --files option for spark-submit to propagate the 
> log4j.properties file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6235) Address various 2G limits

2015-10-14 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957540#comment-14957540
 ] 

Glenn Strycker commented on SPARK-6235:
---

Until this issue and sub-issue tickets are solved, are there any known 
work-arounds?  Increase number of partitions, or decrease?  Split up RDDs into 
parts, run your command, and then union?  Turn off Kryo?  Use dataframes?  
Help!!

I am encountering the 2GB bug on attempting to simply (re)partition by key an 
RDD of modest size (84GB) and low skew (AFAIK).  I have my memory requests per 
executor, per master node, per Java, etc. all cranked up as far as they'll go, 
and I'm currently attempting to partition this RDD across 6800 partitions.  
Unless my skew is really bad, I don't see why 12MB per partition would be 
causing a shuffle to hit the 2GB limit, unless the overhead of so many 
partitions is actually hurting rather than helping.  I'm going to try adjusting 
my partition number and see what happens, but I wanted to know if there is a 
standard work-around answer to this 2GB issue.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11105:


Assignee: Apache Spark

> Disitribute the log4j.properties files from the client to the executors
> ---
>
> Key: SPARK-11105
> URL: https://issues.apache.org/jira/browse/SPARK-11105
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Srinivasa Reddy Vundela
>Assignee: Apache Spark
>Priority: Minor
>
> The log4j.properties file from the client is not distributed to the 
> executors. This means that the client settings are not applied to the 
> executors and they run with the default settings.
> This affects troubleshooting and data gathering.
> The workaround is to use the --files option for spark-submit to propagate the 
> log4j.properties file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11068) Add callback to query execution

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957541#comment-14957541
 ] 

Apache Spark commented on SPARK-11068:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9119

> Add callback to query execution
> ---
>
> Key: SPARK-11068
> URL: https://issues.apache.org/jira/browse/SPARK-11068
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957539#comment-14957539
 ] 

Apache Spark commented on SPARK-11105:
--

User 'vundela' has created a pull request for this issue:
https://github.com/apache/spark/pull/9118

> Disitribute the log4j.properties files from the client to the executors
> ---
>
> Key: SPARK-11105
> URL: https://issues.apache.org/jira/browse/SPARK-11105
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Srinivasa Reddy Vundela
>Priority: Minor
>
> The log4j.properties file from the client is not distributed to the 
> executors. This means that the client settings are not applied to the 
> executors and they run with the default settings.
> This affects troubleshooting and data gathering.
> The workaround is to use the --files option for spark-submit to propagate the 
> log4j.properties file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException

2015-10-14 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957451#comment-14957451
 ] 

Steve Loughran commented on SPARK-11109:


This is compatible with Hadoop branch-1; the new exception has existed for a 
long time.

> move FsHistoryProvider off import 
> org.apache.hadoop.fs.permission.AccessControlException
> 
>
> Key: SPARK-11109
> URL: https://issues.apache.org/jira/browse/SPARK-11109
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {{FsHistoryProvider}} imports and uses 
> {{org.apache.hadoop.fs.permission.AccessControlException}}; this has been 
> superceded by its subclass 
> {{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to 
> that subclass would remove a deprecation warning and ensure that were the 
> Hadoop team to remove that old method (as HADOOP-11356 has currently done to 
> trunk), everything still compiles and links



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11109) move FsHistoryProvider off import org.apache.hadoop.fs.permission.AccessControlException

2015-10-14 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-11109:
--

 Summary: move FsHistoryProvider off import 
org.apache.hadoop.fs.permission.AccessControlException
 Key: SPARK-11109
 URL: https://issues.apache.org/jira/browse/SPARK-11109
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Steve Loughran
Priority: Minor


{{FsHistoryProvider}} imports and uses 
{{org.apache.hadoop.fs.permission.AccessControlException}}; this has been 
superceded by its subclass 
{{org.apache.hadoop.security.AccessControlException}} since ~2011. Moving to 
that subclass would remove a deprecation warning and ensure that were the 
Hadoop team to remove that old method (as HADOOP-11356 has currently done to 
trunk), everything still compiles and links



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11078) Ensure spilling tests are actually spilling

2015-10-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-11078:
-

Assignee: Andrew Or

> Ensure spilling tests are actually spilling
> ---
>
> Key: SPARK-11078
> URL: https://issues.apache.org/jira/browse/SPARK-11078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> The new unified memory management model in SPARK-10983 uncovered many brittle 
> tests that rely on arbitrary thresholds to detect spilling. Some tests don't 
> even assert that spilling did occur.
> We should go through all the places where we test spilling behavior and 
> correct the tests, a subset of which are definitely incorrect. Potential 
> suspects:
> - UnsafeShuffleSuite
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - SQLQuerySuite
> - DistributedSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-14 Thread Jason C Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason C Lee updated SPARK-10943:

Comment: was deleted

(was: I'd like to work on this. Thanx)

> NullType Column cannot be written to Parquet
> 
>
> Key: SPARK-10943
> URL: https://issues.apache.org/jira/browse/SPARK-10943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jason Pohl
>
> var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
> as comments")
> //FAIL - Try writing a NullType column (where all the values are NULL)
> data02.write.parquet("/tmp/celtra-test/dataset2")
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 179.0 (TID 39924, 10.0.196.208): 
> org.apache.spark.sql.AnalysisException: Unsupported data type 
> StructField(comments,NullType,true).dataType;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.parquet

[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-10-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957375#comment-14957375
 ] 

Marcelo Vanzin commented on SPARK-10873:


What I mean is that while replacing the sorting library is sort of easy, by 
itself it doesn't really solve the problem.

Pagination is currently done in the backend, meaning the backend will generate 
hardcoded HTML with the current page, instead of something that can be easily 
consumed by a client-side library to do pagination and sorting on the client.

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-10-14 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957360#comment-14957360
 ] 

Thomas Graves commented on SPARK-10873:
---

[~vanzin]  I assume your comment about the backend need to change is to fix 
sorting and such with the existing implementation and not with data tables?   

The datatables generally send all the data and then on the client side does the 
sorting, pagination, etc and from my experience on Hadoop has been very 
performant.  The biggest issue is transferring the data if its a lot but unless 
you go to server side that is going to be an issue with anything.

I agree with you that the sorting currently that doesn't span pages is 
confusing which is why I was thinking something like the datatables that does 
it for us already would be easier.

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10513) Springleaf Marketing Response

2015-10-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957345#comment-14957345
 ] 

Joseph K. Bradley commented on SPARK-10513:
---

[~yanboliang]  This is really helpful feedback.  Thanks very much for taking 
the time!  I'll try to list plans for addressing the various issues you found:

1. Here's the closest issue I could find for spark-csv: 
[https://github.com/databricks/spark-csv/issues/48]  Would you mind commenting 
there to try to escalate the issue?

2. What would be your ideal way to write this in the DataFrame API?  Something 
like 
{{train.withColumn(train("label").cast(DoubleType).as("label")).na.drop()}}?  
(I think that almost works now, but I'm not actually sure if the cast works or 
fails when it encounters empty Strings.)

3. Just made a JIRA: [SPARK-11108]

4. Do you mean a completely missing value?  Or do you mean that StringIndexer 
should handle an empty String differently?

5. Multi-value support for transformers: [SPARK-8418]

6. Here's some more detailed discussion which I just wrote down: [SPARK-11106]

I haven't yet looked at your example code, but will try to soon.  Thanks again 
for working on this!

> Springleaf Marketing Response
> -
>
> Key: SPARK-10513
> URL: https://issues.apache.org/jira/browse/SPARK-10513
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Apply ML pipeline API to Springleaf Marketing Response 
> (https://www.kaggle.com/c/springleaf-marketing-response)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11108) OneHotEncoder should support other numeric input types

2015-10-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-11108:
-

 Summary: OneHotEncoder should support other numeric input types
 Key: SPARK-11108
 URL: https://issues.apache.org/jira/browse/SPARK-11108
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


See parent JIRA for more info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11108) OneHotEncoder should support other numeric input types

2015-10-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11108:
--
Description: 
See parent JIRA for more info.

Also see [SPARK-10513] for motivation behind issue.

  was:See parent JIRA for more info.


> OneHotEncoder should support other numeric input types
> --
>
> Key: SPARK-11108
> URL: https://issues.apache.org/jira/browse/SPARK-11108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> See parent JIRA for more info.
> Also see [SPARK-10513] for motivation behind issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11040) SaslRpcHandler does not delegate all methods to underlying handler

2015-10-14 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11040.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 1.6.0

> SaslRpcHandler does not delegate all methods to underlying handler
> --
>
> Key: SPARK-11040
> URL: https://issues.apache.org/jira/browse/SPARK-11040
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.6.0
>
>
> {{SaslRpcHandler}} only delegates {{receive}} and {{getStreamManager}}, so 
> when SASL is enabled, other events will be missed by apps.
> This affects other version too, but I think these events aren't actually used 
> there. They'll be used by the new rpc backend in 1.6, though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11107) spark.ml should support more input column types: umbrella

2015-10-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-11107:
-

 Summary: spark.ml should support more input column types: umbrella
 Key: SPARK-11107
 URL: https://issues.apache.org/jira/browse/SPARK-11107
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley


This is an umbrella for expanding the set of data types which spark.ml Pipeline 
stages can take.  This should not involve breaking APIs, but merely involve 
slight changes such as supporting all Numeric types instead of just Double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-14 Thread Dominic Ricard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957317#comment-14957317
 ] 

Dominic Ricard edited comment on SPARK-11103 at 10/14/15 5:24 PM:
--

Strangely enough, this works perfectly:
{noformat}
select 
  col1
from
  `table3`
where
  (CASE WHEN col2 = 2 THEN true ELSE false END) = true;
{noformat}

And returns only the row that contains col2 = 2


was (Author: dricard):
Strangely enough, this works perfectly:
{noformat}
select col1 from `table3` where CASE WHEN col2 = 2 THEN true ELSE false END = 
true;
{noformat}

And returns only the row that contains col2 = 2

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRD

[jira] [Updated] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-10-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7425:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-11107

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11107) spark.ml should support more input column types: umbrella

2015-10-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11107:
--
Issue Type: Umbrella  (was: Improvement)

> spark.ml should support more input column types: umbrella
> -
>
> Key: SPARK-11107
> URL: https://issues.apache.org/jira/browse/SPARK-11107
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is an umbrella for expanding the set of data types which spark.ml 
> Pipeline stages can take.  This should not involve breaking APIs, but merely 
> involve slight changes such as supporting all Numeric types instead of just 
> Double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-14 Thread Dominic Ricard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957317#comment-14957317
 ] 

Dominic Ricard commented on SPARK-11103:


Strangely enough, this works perfectly:
{noformat}
select col1 from `table3` where CASE WHEN col2 = 2 THEN true ELSE false END = 
true;
{noformat}

And returns only the row that contains col2 = 2

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrR

[jira] [Created] (SPARK-11106) Should ML Models contains single models or Pipelines?

2015-10-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-11106:
-

 Summary: Should ML Models contains single models or Pipelines?
 Key: SPARK-11106
 URL: https://issues.apache.org/jira/browse/SPARK-11106
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Priority: Critical


This JIRA is for discussing whether an ML Estimators should do feature 
processing.

h2. Issue

Currently, almost all ML Estimators require strict input types.  E.g., 
DecisionTreeClassifier requires that the label column be Double type and have 
metadata indicating the number of classes.

This requires users to know how to preprocess data.

h2. Ideal workflow

A user should be able to pass any reasonable data to a Transformer or Estimator 
and have it "do the right thing."

E.g.:
* If DecisionTreeClassifier is given a String column for labels, it should know 
to index the Strings.
* See [SPARK-10513] for a similar issue with OneHotEncoder.

h2. Possible solutions

There are a few solutions I have thought of.  Please comment with feedback or 
alternative ideas!

h3. Leave as is

Pro: The current setup is good in that it forces the user to be very aware of 
what they are doing.  Feature transformations will not happen silently.

Con: The user has to write boilerplate code for transformations.  The API is 
not what some users would expect; e.g., coming from R, a user might expect some 
automatic transformations.

h3. All Transformers can contain PipelineModels

We could allow all Transformers and Models to contain arbitrary PipelineModels. 
 E.g., if a DecisionTreeClassifier were given a String label column, it might 
return a Model which contains a simple fitted PipelineModel containing 
StringIndexer + DecisionTreeClassificationModel.

The API could present this to the user, or it could be hidden from the user.  
Ideally, it would be hidden from the beginner user, but accessible for experts.

The main problem is that we might have to break APIs.  E.g., OneHotEncoder may 
need to do indexing if given a String input column.  This means it should no 
longer be a Transformer; it should be an Estimator.

h3. All Estimators should use RFormula

The best option I have thought of is to make RFormula be the primary method for 
automatic feature transformation.  We could start adding an RFormula Param to 
all Estimators, and it could handle most of these feature transformation issues.

We could maintain old APIs:
* If a user sets the input column names, then those can be used in the 
traditional (no automatic transformation) way.
* If a user sets the RFormula Param, then it can be used instead.  (This should 
probably take precedence over the old API.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-14 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957301#comment-14957301
 ] 

Zhan Zhang commented on SPARK-11087:


I will take a look at this one.

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.388 seconds, Fetched: 58 row(s)
> 
> Data was inserted into this table by another spark job>
> df.write.format("org.apache.spark.sq

[jira] [Resolved] (SPARK-10619) Can't sort columns on Executor Page

2015-10-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10619.
-
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.2

> Can't sort columns on Executor Page
> ---
>
> Key: SPARK-10619
> URL: https://issues.apache.org/jira/browse/SPARK-10619
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 1.5.2, 1.6.0
>
>
> I am using spark 1.5 running on yarn and go to the executors page.  It won't 
> allow sorting of the columns. This used to work in Spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11083) insert overwrite table failed when beeline reconnect

2015-10-14 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957281#comment-14957281
 ] 

Davies Liu commented on SPARK-11083:


Maybe this one: https://github.com/apache/spark/pull/8909, it uses a separate 
session for each connection.

> insert overwrite table failed when beeline reconnect
> 
>
> Key: SPARK-11083
> URL: https://issues.apache.org/jira/browse/SPARK-11083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark: master branch
> Hadoop: 2.7.1
> JDK: 1.8.0_60
>Reporter: Weizhong
> Fix For: 1.6.0
>
>
> 1. Start Thriftserver
> 2. Use beeline connect to thriftserver, then execute "insert overwrite 
> table_name ..." clause -- success
> 3. Exit beelin
> 4. Reconnect to thriftserver, and then execute "insert overwrite table_name 
> ..." clause. -- failed
> {noformat}
> 15/10/13 18:44:35 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:520)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:505)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:739)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:224)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://9.91.8.214:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_returns/.hive-staging_hive_2015-10-13_18-44-17_606_2400736035447406540-2/-ext-1/cr_returned_date=2003-08-27/part-00048
>  to destinat

[jira] [Resolved] (SPARK-11083) insert overwrite table failed when beeline reconnect

2015-10-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11083.

   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 1.6.0

> insert overwrite table failed when beeline reconnect
> 
>
> Key: SPARK-11083
> URL: https://issues.apache.org/jira/browse/SPARK-11083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark: master branch
> Hadoop: 2.7.1
> JDK: 1.8.0_60
>Reporter: Weizhong
>Assignee: Davies Liu
> Fix For: 1.6.0
>
>
> 1. Start Thriftserver
> 2. Use beeline connect to thriftserver, then execute "insert overwrite 
> table_name ..." clause -- success
> 3. Exit beelin
> 4. Reconnect to thriftserver, and then execute "insert overwrite table_name 
> ..." clause. -- failed
> {noformat}
> 15/10/13 18:44:35 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:520)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:505)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:739)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:224)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://9.91.8.214:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_returns/.hive-staging_hive_2015-10-13_18-44-17_606_2400736035447406540-2/-ext-1/cr_returned_date=2003-08-27/part-00048
>  to destination 
> hdfs://9.91.8.214:9000/user/hive/warehouse/

[jira] [Updated] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors

2015-10-14 Thread Srinivasa Reddy Vundela (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivasa Reddy Vundela updated SPARK-11105:

Summary: Disitribute the log4j.properties files from the client to the 
executors  (was: Dsitribute the log4j.properties files from the client to the 
executors)

> Disitribute the log4j.properties files from the client to the executors
> ---
>
> Key: SPARK-11105
> URL: https://issues.apache.org/jira/browse/SPARK-11105
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Srinivasa Reddy Vundela
>Priority: Minor
>
> The log4j.properties file from the client is not distributed to the 
> executors. This means that the client settings are not applied to the 
> executors and they run with the default settings.
> This affects troubleshooting and data gathering.
> The workaround is to use the --files option for spark-submit to propagate the 
> log4j.properties file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11105) Dsitribute the log4j.properties files from the client to the executors

2015-10-14 Thread Srinivasa Reddy Vundela (JIRA)
Srinivasa Reddy Vundela created SPARK-11105:
---

 Summary: Dsitribute the log4j.properties files from the client to 
the executors
 Key: SPARK-11105
 URL: https://issues.apache.org/jira/browse/SPARK-11105
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.5.1
Reporter: Srinivasa Reddy Vundela
Priority: Minor


The log4j.properties file from the client is not distributed to the executors. 
This means that the client settings are not applied to the executors and they 
run with the default settings.
This affects troubleshooting and data gathering.
The workaround is to use the --files option for spark-submit to propagate the 
log4j.properties file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-11099:
---
Affects Version/s: (was: 1.5.1)

(Removing affected version. This code does not exist in branch-1.5.)

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11098) RPC message ordering is not guaranteed

2015-10-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957232#comment-14957232
 ] 

Marcelo Vanzin commented on SPARK-11098:


I'm not explicitly working on this a.t.m..

> RPC message ordering is not guaranteed
> --
>
> Key: SPARK-11098
> URL: https://issues.apache.org/jira/browse/SPARK-11098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> NettyRpcEnv doesn't guarantee message delivery order since there are multiple 
> threads sending messages in clientConnectionExecutor thread pool. We should 
> fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11097) Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive

2015-10-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957230#comment-14957230
 ] 

Marcelo Vanzin commented on SPARK-11097:


Hi [~rxin], can you explain what would be the use case for this? Is it just to 
simplify the code?

I'm working on SPARK-10997 and I have changed code around that area a lot. I 
was able to simplify the code without the need for a connection established 
callback.

> Add connection established callback to lower level RPC layer so we don't need 
> to check for new connections in NettyRpcHandler.receive
> -
>
> Key: SPARK-11097
> URL: https://issues.apache.org/jira/browse/SPARK-11097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> I think we can remove the check for new connections in 
> NettyRpcHandler.receive if we just add a channel registered callback to the 
> lower level network module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-10-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957210#comment-14957210
 ] 

Marcelo Vanzin commented on SPARK-10873:


As part of trying to fix SPARK-10172 I played with jquery datatables, and it 
works fine even with rowspan. But I thought it would be too big a change for 
the 1.5 branch.

Also, I still believe that sorting with the current pagination code is 
confusing and not very helpful. To enable proper sorting / searching, the 
backend would need to be changed to support something more dynamic, so that the 
client can make the decision about what to show and how.

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning

2015-10-14 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957194#comment-14957194
 ] 

Cheolsoo Park commented on SPARK-6910:
--

You're right that 2nd query is faster because the table/partition metadata is 
cached. Particularly, if you set {{spark.sql.hive.metastorePartitionPruning}} 
false (by default, false), Spark will cache metadata for all the partitions and 
any query against the same table will run faster even with a different 
predicate. See relevant code 
[here|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L830-L839].

> Support for pushing predicates down to metastore for partition pruning
> --
>
> Key: SPARK-6910
> URL: https://issues.apache.org/jira/browse/SPARK-6910
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheolsoo Park
>Priority: Critical
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2015-10-14 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957170#comment-14957170
 ] 

Xiao Li commented on SPARK-10925:
-

Hi, Alexis, 

The schema of your query results has the duplicate column names. 

In your test case, you just need to fix one line:
val cardinalityDF2 = df4.groupBy("surname")
  .agg(count("surname").as("cardinality_surname"))
-->
val cardinalityDF2 = df4.groupBy("surname")
  
.agg(count("surname").as("cardinality_surname")).withColumnRenamed("surname", 
"surname_new")
cardinalityDF2.show()

I think Spark SQL should detect the problem in the earlier stage. I will try to 
fix the problem and output an error message. 

Let me know if you have more questions. Thanks! 

Xiao Li


> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.Data

[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957144#comment-14957144
 ] 

Sandy Ryza commented on SPARK-:
---

Maybe you all have thought through this as well, but I had some more thoughts 
on the proposed API.

Fundamentally, it seems a weird to me that the user is responsible for having a 
matching Encoder around every time they want to map to a class of a particular 
type.  In 99% of cases, the Encoder used to encode any given type will be the 
same, and it seems more intuitive to me to specify this up front.

To be more concrete, suppose I want to use case classes in my app and have a 
function that can auto-generate an Encoder from a class object (though this 
might be a little bit time consuming because it needs to use reflection).  With 
the current proposal, any time I want to map my Dataset to a Dataset of some 
case class, I need to either have a line of code that generates an Encoder for 
that case class, or have an Encoder already lying around.  If I perform this 
operation within a method, I need to pass the Encoder down to the method and 
include it in the signature.

Ideally I would be able to register an EncoderSystem up front that caches 
Encoders and generates new Encoders whenever it sees a new class used.  This 
still of course requires the user to pass in type information when they call 
map, but it's easier for them to get this information than an actual encoder.  
If there's not some principled way to get this working implicitly with 
ClassTags, the user could just pass in classOf[MyCaseClass] as the second 
argument to map.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957131#comment-14957131
 ] 

Apache Spark commented on SPARK-2533:
-

User 'jbonofre' has created a pull request for this issue:
https://github.com/apache/spark/pull/9117

> Show summary of locality level of completed tasks in the each stage page of 
> web UI
> --
>
> Key: SPARK-2533
> URL: https://issues.apache.org/jira/browse/SPARK-2533
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> When the number of tasks is very large, it is impossible to know how many 
> tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the 
> stage page of web UI. It would be better to show the summary of task locality 
> level in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2533:
---

Assignee: Apache Spark

> Show summary of locality level of completed tasks in the each stage page of 
> web UI
> --
>
> Key: SPARK-2533
> URL: https://issues.apache.org/jira/browse/SPARK-2533
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Masayoshi TSUZUKI
>Assignee: Apache Spark
>Priority: Minor
>
> When the number of tasks is very large, it is impossible to know how many 
> tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the 
> stage page of web UI. It would be better to show the summary of task locality 
> level in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2533:
---

Assignee: (was: Apache Spark)

> Show summary of locality level of completed tasks in the each stage page of 
> web UI
> --
>
> Key: SPARK-2533
> URL: https://issues.apache.org/jira/browse/SPARK-2533
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> When the number of tasks is very large, it is impossible to know how many 
> tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the 
> stage page of web UI. It would be better to show the summary of task locality 
> level in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10876) display total application time in spark history UI

2015-10-14 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-10876:
--
Assignee: Jean-Baptiste Onofré

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Jean-Baptiste Onofré
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-10-14 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957080#comment-14957080
 ] 

Thomas Graves commented on SPARK-10873:
---

Has anyone discussed just using something like jquery datatables or similar 
which automatically gives us search 
(https://issues.apache.org/jira/browse/SPARK-10874), sort, pagination, etc?
I'm not sure how well the row spanning works with the datatables it seems its 
possible: http://www.datatables.net/examples/advanced_init/row_grouping.html

I know there are others like jqgrid but I'm by no means a UI expert and have 
used datatables some in hadoop.

[~rxin]  [~zsxwing]  thoughts on using something like the jquery datatables?  

What about just using it on certain pages like the history page first. The 
downside is pages might look different.  As more and more people use spark 
being able to use the history page and debug is becoming bigger and bigger issue

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11102) Unreadable exception when specifing non-exist input for JSON data source

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11102:
---
Summary: Unreadable exception when specifing non-exist input for JSON data 
source  (was: Not readable exception when specifing non-exist input for JSON 
data source)

> Unreadable exception when specifing non-exist input for JSON data source
> 
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-14 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956988#comment-14956988
 ] 

Hyukjin Kwon commented on SPARK-11103:
--

I tested this case. The problem was, Parquet filters are pushed down regardless 
of each schema of the splits (or rather files).

Would the predicate pushdown be prevented when using mergeSchema option?

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.compute

[jira] [Assigned] (SPARK-11104) A potential deadlock in StreamingContext.stop and stopOnShutdown

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11104:


Assignee: (was: Apache Spark)

> A potential deadlock in StreamingContext.stop and stopOnShutdown
> 
>
> Key: SPARK-11104
> URL: https://issues.apache.org/jira/browse/SPARK-11104
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> When the shutdown hook of StreamingContext and StreamingContext.stop are 
> running at the same time (e.g., press CTRL-C when StreamingContext.stop is 
> running), the following deadlock may happen:
> {code}
> Java stack information for the threads listed above:
> ===
> "Thread-2":
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699)
>   - waiting to lock <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236)
>   - locked <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> "main":
>   at 
> org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248)
>   - waiting to lock <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11104) A potential deadlock in StreamingContext.stop and stopOnShutdown

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11104:


Assignee: Apache Spark

> A potential deadlock in StreamingContext.stop and stopOnShutdown
> 
>
> Key: SPARK-11104
> URL: https://issues.apache.org/jira/browse/SPARK-11104
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> When the shutdown hook of StreamingContext and StreamingContext.stop are 
> running at the same time (e.g., press CTRL-C when StreamingContext.stop is 
> running), the following deadlock may happen:
> {code}
> Java stack information for the threads listed above:
> ===
> "Thread-2":
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699)
>   - waiting to lock <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236)
>   - locked <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> "main":
>   at 
> org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248)
>   - waiting to lock <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11104) A potential deadlock in StreamingContext.stop and stopOnShutdown

2015-10-14 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-11104:


 Summary: A potential deadlock in StreamingContext.stop and 
stopOnShutdown
 Key: SPARK-11104
 URL: https://issues.apache.org/jira/browse/SPARK-11104
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu


When the shutdown hook of StreamingContext and StreamingContext.stop are 
running at the same time (e.g., press CTRL-C when StreamingContext.stop is 
running), the following deadlock may happen:

{code}
Java stack information for the threads listed above:
===
"Thread-2":
at 
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699)
- waiting to lock <0x0005405a1680> (a 
org.apache.spark.streaming.StreamingContext)
at 
org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729)
at 
org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625)
at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
at scala.util.Try$.apply(Try.scala:161)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236)
- locked <0x0005405b6a00> (a 
org.apache.spark.util.SparkShutdownHookManager)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
"main":
at 
org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248)
- waiting to lock <0x0005405b6a00> (a 
org.apache.spark.util.SparkShutdownHookManager)
at 
org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199)
at 
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712)
- locked <0x0005405a1680> (a 
org.apache.spark.streaming.StreamingContext)
at 
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684)
- locked <0x0005405a1680> (a 
org.apache.spark.streaming.StreamingContext)
at 
org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108)
at 
org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11104) A potential deadlock in StreamingContext.stop and stopOnShutdown

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956974#comment-14956974
 ] 

Apache Spark commented on SPARK-11104:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/9116

> A potential deadlock in StreamingContext.stop and stopOnShutdown
> 
>
> Key: SPARK-11104
> URL: https://issues.apache.org/jira/browse/SPARK-11104
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> When the shutdown hook of StreamingContext and StreamingContext.stop are 
> running at the same time (e.g., press CTRL-C when StreamingContext.stop is 
> running), the following deadlock may happen:
> {code}
> Java stack information for the threads listed above:
> ===
> "Thread-2":
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699)
>   - waiting to lock <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729)
>   at 
> org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236)
>   - locked <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> "main":
>   at 
> org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248)
>   - waiting to lock <0x0005405b6a00> (a 
> org.apache.spark.util.SparkShutdownHookManager)
>   at 
> org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684)
>   - locked <0x0005405a1680> (a 
> org.apache.spark.streaming.StreamingContext)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108)
>   at 
> org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-14 Thread Dominic Ricard (JIRA)
Dominic Ricard created SPARK-11103:
--

 Summary: Filter applied on Merged Parquet shema with new column 
fail with (java.lang.IllegalArgumentException: Column [column_name] was not 
found in schema!)
 Key: SPARK-11103
 URL: https://issues.apache.org/jira/browse/SPARK-11103
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Dominic Ricard


When evolving a schema in parquet files, spark properly expose all columns 
found in the different parquet files but when trying to query the data, it is 
not possible to apply a filter on a column that is not present in all files.

To reproduce:
*SQL:*
{noformat}
create table `table1` STORED AS PARQUET LOCATION 
'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
create table `table2` STORED AS PARQUET LOCATION 
'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as `col2`;
create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
"hdfs://:/path/to/table");
select col1 from `table3` where col2 = 2;
{noformat}

The last select will output the following Stack Trace:
{noformat}
An error occurred when executing the SQL command:
select col1 from `table3` where col2 = 2

[Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
0, SQL state: TStatus(statusCode:ERROR_STATUS, 
infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
 Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 208.92.52.88): 
java.lang.IllegalArgumentException: Column [col2] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
at 
org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
at 
org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace::26:25, 
org.apache.spark.sql.hive.thriftserver.S

[jira] [Commented] (SPARK-10513) Springleaf Marketing Response

2015-10-14 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956832#comment-14956832
 ] 

Yanbo Liang commented on SPARK-10513:
-

[~mengxr]  A simple model training example has been posted at 
[here|https://github.com/yanboliang/Springleaf/blob/master/src/main/scala/com/ybliang/kaggle/Springleaf.scala].
 Although the code snippet looks sometimes naive because of I just want to only 
use the components which provided by Spark DataFrame and ML/MLlib, I will 
update the code snippet continuously after the issues found in this example has 
been resolved.

I found the Springleaf competition is more difficulty than San Francisco crime 
classification because of a great many number of column (#1934), little 
semantic knowledge of columns, missing and mistake data, but it more useful to 
test ML pipeline.

In general case, we should start with GBT model if we have little knowledge of 
the data. I intend to illustrate the usage of logistic regression model because 
of SPARK-10055 already has example of using Decision Tree model, so I try to 
train a LR model.

I list the issues and requirements that I have found during the model training 
and prediction process:

1, Although there is an option to infer schema by spark-csv, it can only 
identify DoubleType and StringType. For example, the timestamp column(which 
should be loaded as TimeStampType) will be identified as StringType by mistake 
and it will lead to OneHotEncoder produce massive number of encoded features.
2, Missing data and mistake data. I think DataFrame should provide method to 
replace null value or “” value by a user specific value just like 
“train[is.na(train)] <- -1” in R. It will be better if we can provide methods 
to remove illegal value. I see that DataFrame has “.na.drop()" method but it 
does not enough.
3, OneHotEncoder only can be fitted on the column with DoubleType currently. I 
think we can extends it to also support other NumericType such as IntType.
4, OneHotEncoder should consider a better way to handle “” value. 
5, Usually we have the requirements to use OneHotEncoder to encode multiple 
columns at the same time, but ML does not provide this ability AFAIK.
6, I found StringIndexer and OneHotEncoder often use simultaneously, should we 
provide the binding of these two feature transformers?

Looking forward your comments about these points I have found in my code 
practice. And I can submit some patches to fix these issues if they are in the 
scope of Spark roadmap. And after that I will update the code of this example.

> Springleaf Marketing Response
> -
>
> Key: SPARK-10513
> URL: https://issues.apache.org/jira/browse/SPARK-10513
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Apply ML pipeline API to Springleaf Marketing Response 
> (https://www.kaggle.com/c/springleaf-marketing-response)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11027) Better group distinct columns in query compilation

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11027:


Assignee: Apache Spark

> Better group distinct columns in query compilation
> --
>
> Key: SPARK-11027
> URL: https://issues.apache.org/jira/browse/SPARK-11027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> In AggregationQuerySuite, we have a test
> {code}
> checkAnswer(
>   sqlContext.sql(
> """
>   |SELECT sum(distinct value1), kEY - 100, count(distinct value1)
>   |FROM agg2
>   |GROUP BY Key - 100
> """.stripMargin),
>   Row(40, -99, 2) :: Row(0, -98, 2) :: Row(null, -97, 0) :: Row(30, null, 
> 3) :: Nil)
> {code}
> We will treat it as having two distinct columns because sum causes a cast on 
> value1. Maybe we can ignore the cast when we group distinct columns. So, it 
> will not be treated as having two distinct columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11027) Better group distinct columns in query compilation

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11027:


Assignee: (was: Apache Spark)

> Better group distinct columns in query compilation
> --
>
> Key: SPARK-11027
> URL: https://issues.apache.org/jira/browse/SPARK-11027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> In AggregationQuerySuite, we have a test
> {code}
> checkAnswer(
>   sqlContext.sql(
> """
>   |SELECT sum(distinct value1), kEY - 100, count(distinct value1)
>   |FROM agg2
>   |GROUP BY Key - 100
> """.stripMargin),
>   Row(40, -99, 2) :: Row(0, -98, 2) :: Row(null, -97, 0) :: Row(30, null, 
> 3) :: Nil)
> {code}
> We will treat it as having two distinct columns because sum causes a cast on 
> value1. Maybe we can ignore the cast when we group distinct columns. So, it 
> will not be treated as having two distinct columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11027) Better group distinct columns in query compilation

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956672#comment-14956672
 ] 

Apache Spark commented on SPARK-11027:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9115

> Better group distinct columns in query compilation
> --
>
> Key: SPARK-11027
> URL: https://issues.apache.org/jira/browse/SPARK-11027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> In AggregationQuerySuite, we have a test
> {code}
> checkAnswer(
>   sqlContext.sql(
> """
>   |SELECT sum(distinct value1), kEY - 100, count(distinct value1)
>   |FROM agg2
>   |GROUP BY Key - 100
> """.stripMargin),
>   Row(40, -99, 2) :: Row(0, -98, 2) :: Row(null, -97, 0) :: Row(30, null, 
> 3) :: Nil)
> {code}
> We will treat it as having two distinct columns because sum causes a cast on 
> value1. Maybe we can ignore the cast when we group distinct columns. So, it 
> will not be treated as having two distinct columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11058) failed spark job reports on YARN as successful

2015-10-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956640#comment-14956640
 ] 

Sean Owen commented on SPARK-11058:
---

I suspect that either: your job didn't actually fail at the driver level? or, 
this is in fact the same problem handling the ".inprogress" file reported in 
other JIRAs and fixed since 1.3.x. Are you able to try 1.5.x?

> failed spark job reports on YARN as successful
> --
>
> Key: SPARK-11058
> URL: https://issues.apache.org/jira/browse/SPARK-11058
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
> Environment: CDH 5.4
>Reporter: Lan Jiang
>Priority: Minor
>
> I have a spark batch job running on CDH5.4 + Spark 1.3.0. Job is submitted in 
> “yarn-client” mode. The job itself failed due to YARN kills several executor 
> containers because the containers exceeded the memory limit posed by YARN. 
> However, when I went to the YARN resource manager site, it displayed the job 
> as successful. I found there was an issue reported in JIRA 
> https://issues.apache.org/jira/browse/SPARK-3627, but it says it was fixed in 
> Spark 1.2. On Spark history server, it shows the job as “Incomplete”. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11049) If a single executor fails to allocate memory, entire job fails

2015-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11049.
---
Resolution: Not A Problem

Pending more info

> If a single executor fails to allocate memory, entire job fails
> ---
>
> Key: SPARK-11049
> URL: https://issues.apache.org/jira/browse/SPARK-11049
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Brian
>
> To reproduce:
> * Create a spark cluster using start-master.sh and start-slave.sh (I believe 
> this is the "standalone cluster manager?").  
> * Leave a process running on some nodes that take up about significant 
> amounts of RAM.
> * Leave some nodes with plenty of RAM to run spark.
> * Run a job against this cluster with spark.executor.memory asking for all or 
> most of the memory available on each node.
> On the node that has insufficient memory, there will of course be an error 
> like:
> Error occurred during initialization of VM
> Could not reserve enough space for object heap
> Could not create the Java virtual machine.
> On the driver node, and in the spark master UI, I see that _all_ executors 
> exit or are killed, and the entire job fails.  It would be better if there 
> was an indication of which individual node is actually at fault.  It would 
> also be better if the cluster manager could handle failing-over to nodes that 
> are still operating properly and have sufficient RAM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11101) pipe() operation OOM

2015-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11101.
---
Resolution: Invalid

If it's a question, you should ask as u...@spark.apache.org, not make a JIRA. 
It may have nothing to do with your process, though you do need to verify how 
much it uses. There is little margin in the YARN allocation for off-heap 
memory, so you probably have to increase this value, yes.

> pipe() operation OOM
> 
>
> Key: SPARK-11101
> URL: https://issues.apache.org/jira/browse/SPARK-11101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: spark on yarn
>Reporter: hotdog
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> when using pipe() operation with large data(10TB), the pipe() operation 
> always OOM. 
> I use pipe() to calling a external c++ process. I'm sure the c++ program only 
> use little memory(about 1MB).
> my parameters:
> executor-memory 16g
> executor-cores 4
> num-executors 400
> "spark.yarn.executor.memoryOverhead", "8192"
> partition number: 6
> does pipe() operation use many off-heap memory? 
> the log is :
> killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> should I continue boosting spark.yarn.executor.memoryOverhead? Or there are 
> some bugs in the pipe() operation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7099) Floating point literals cannot be specified using exponent

2015-10-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7099.
--
Resolution: Not A Problem

> Floating point literals cannot be specified using exponent
> --
>
> Key: SPARK-7099
> URL: https://issues.apache.org/jira/browse/SPARK-7099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: Windows, Linux, Mac OS X
>Reporter: Peter Hagelund
>Priority: Minor
>
> Floating point literals cannot be expressed in scientific notation using an 
> exponent, like e.g. 1.23E4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning

2015-10-14 Thread qian, chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956508#comment-14956508
 ] 

qian, chen edited comment on SPARK-6910 at 10/14/15 9:03 AM:
-

I'm using spark-sql (spark version 1.5.1 && hadoop 2.4.0) and found a very 
interesting thing:
in spark-sql shell:
at first I ran this, it took about 3 minutes
select * from table1 where date='20151010' and hour='12' and name='x' limit 5;
Time taken: 164.502 seconds

then I ran this, it only took 10s. date, hour and name are partition columns in 
this hive table. this table has >4000 partitions
select * from table1 where date='20151010' and hour='13' limit 5;
Time taken: 10.881 seconds
is it because that the first time I need to download all partition information 
from hive metastore? the second query is faster because all partitions are 
cached in memory now?
any suggestions about speeding up the first query?


was (Author: nedqian):
I'm using spark-sql (spark version 1.5.1 && hadoop 2.4.0) and found a very 
interesting thing:
in spark-sql shell:
at first I ran this, it took about 3 minutes
select * from table1 where date='20151010' and hour='12' and name='x' limit 5;
Time taken: 164.502 seconds

then I ran this, it only took 10s. date, hour and name are partition columns in 
this hive table. this table has >4000 partitions
select * from table1 where date='20151010' and hour='13' limit 5;
Time taken: 10.881 seconds
is it because that the first time I need to download all partition information 
from hive metastore? the second query is faster because all partitions are 
cached in memory now?

> Support for pushing predicates down to metastore for partition pruning
> --
>
> Key: SPARK-6910
> URL: https://issues.apache.org/jira/browse/SPARK-6910
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheolsoo Park
>Priority: Critical
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning

2015-10-14 Thread qian, chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956508#comment-14956508
 ] 

qian, chen commented on SPARK-6910:
---

I'm using spark-sql (spark version 1.5.1 && hadoop 2.4.0) and found a very 
interesting thing:
in spark-sql shell:
at first I ran this, it took about 3 minutes
select * from table1 where date='20151010' and hour='12' and name='x' limit 5;
Time taken: 164.502 seconds

then I ran this, it only took 10s. date, hour and name are partition columns in 
this hive table. this table has >4000 partitions
select * from table1 where date='20151010' and hour='13' limit 5;
Time taken: 10.881 seconds
is it because that the first time I need to download all partition information 
from hive metastore? the second query is faster because all partitions are 
cached in memory now?

> Support for pushing predicates down to metastore for partition pruning
> --
>
> Key: SPARK-6910
> URL: https://issues.apache.org/jira/browse/SPARK-6910
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheolsoo Park
>Priority: Critical
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11099:
---
Affects Version/s: 1.5.1

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source

2015-10-14 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956491#comment-14956491
 ] 

Jeff Zhang commented on SPARK-11102:


Will create a pull request soon

> Not readable exception when specifing non-exist input for JSON data source
> --
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source

2015-10-14 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-11102:
--

 Summary: Not readable exception when specifing non-exist input for 
JSON data source
 Key: SPARK-11102
 URL: https://issues.apache.org/jira/browse/SPARK-11102
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Jeff Zhang


If I specify a non-exist input path for json data source, the following 
exception will be thrown, it is not readable. 

{code}
15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 19.9 KB, free 251.4 KB)
15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
:19
java.io.IOException: No input paths specified in job
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
at 
org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
at $iwC$$iwC$$iwC$$iwC.(:30)
at $iwC$$iwC$$iwC.(:32)
at $iwC$$iwC.(:34)
at $iwC.(:36)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11102:
---
Priority: Minor  (was: Major)

> Not readable exception when specifing non-exist input for JSON data source
> --
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11102) Not readable exception when specifing non-exist input for JSON data source

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11102:
---
Issue Type: Improvement  (was: Bug)

> Not readable exception when specifing non-exist input for JSON data source
> --
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11099:


Assignee: (was: Apache Spark)

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11099:


Assignee: Apache Spark

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Commented] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956470#comment-14956470
 ] 

Apache Spark commented on SPARK-11099:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9114

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11100) HiveThriftServer not registering with Zookeeper

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956467#comment-14956467
 ] 

Apache Spark commented on SPARK-11100:
--

User 'xiaowangyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9113

> HiveThriftServer not registering with Zookeeper
> ---
>
> Key: SPARK-11100
> URL: https://issues.apache.org/jira/browse/SPARK-11100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Hive-1.2.1
> Hadoop-2.6.0
>Reporter: Xiaoyu Wang
>
> hive-site.xml config:
> {code}
> 
> hive.server2.support.dynamic.service.discovery
> true
> 
> 
> hive.server2.zookeeper.namespace
> sparkhiveserver2
> 
> 
> hive.zookeeper.quorum
> zk1,zk2,zk3
> 
> {code}
> then start thrift server
> {code}
> start-thriftserver.sh --master yarn
> {code}
> In zookeeper znode "sparkhiveserver2" not found.
> hiveserver2 is working on this config!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11100) HiveThriftServer not registering with Zookeeper

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11100:


Assignee: (was: Apache Spark)

> HiveThriftServer not registering with Zookeeper
> ---
>
> Key: SPARK-11100
> URL: https://issues.apache.org/jira/browse/SPARK-11100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Hive-1.2.1
> Hadoop-2.6.0
>Reporter: Xiaoyu Wang
>
> hive-site.xml config:
> {code}
> 
> hive.server2.support.dynamic.service.discovery
> true
> 
> 
> hive.server2.zookeeper.namespace
> sparkhiveserver2
> 
> 
> hive.zookeeper.quorum
> zk1,zk2,zk3
> 
> {code}
> then start thrift server
> {code}
> start-thriftserver.sh --master yarn
> {code}
> In zookeeper znode "sparkhiveserver2" not found.
> hiveserver2 is working on this config!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11100) HiveThriftServer not registering with Zookeeper

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11100:


Assignee: Apache Spark

> HiveThriftServer not registering with Zookeeper
> ---
>
> Key: SPARK-11100
> URL: https://issues.apache.org/jira/browse/SPARK-11100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Hive-1.2.1
> Hadoop-2.6.0
>Reporter: Xiaoyu Wang
>Assignee: Apache Spark
>
> hive-site.xml config:
> {code}
> 
> hive.server2.support.dynamic.service.discovery
> true
> 
> 
> hive.server2.zookeeper.namespace
> sparkhiveserver2
> 
> 
> hive.zookeeper.quorum
> zk1,zk2,zk3
> 
> {code}
> then start thrift server
> {code}
> start-thriftserver.sh --master yarn
> {code}
> In zookeeper znode "sparkhiveserver2" not found.
> hiveserver2 is working on this config!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11101) pipe() operation OOM

2015-10-14 Thread hotdog (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hotdog updated SPARK-11101:
---
Description: 
when using pipe() operation with large data(10TB), the pipe() operation always 
OOM. 
I use pipe() to calling a external c++ process. I'm sure the c++ program only 
use little memory(about 1MB).
my parameters:
executor-memory 16g
executor-cores 4
num-executors 400
"spark.yarn.executor.memoryOverhead", "8192"
partition number: 6

does pipe() operation use many off-heap memory? 
the log is :
killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.

should I continue boosting spark.yarn.executor.memoryOverhead? Or there are 
some bugs in the pipe() operation?


  was:
when using pipe() operation with large data(10TB), the pipe() operation always 
OOM. 
my parameters:
executor-memory 16g
executor-cores 4
num-executors 400
"spark.yarn.executor.memoryOverhead", "8192"
partition number: 6

does pipe() operation use many off-heap memory? 
the log is :
killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.

should I continue boosting spark.yarn.executor.memoryOverhead? Or there are 
some bugs in the pipe() operation?



> pipe() operation OOM
> 
>
> Key: SPARK-11101
> URL: https://issues.apache.org/jira/browse/SPARK-11101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: spark on yarn
>Reporter: hotdog
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> when using pipe() operation with large data(10TB), the pipe() operation 
> always OOM. 
> I use pipe() to calling a external c++ process. I'm sure the c++ program only 
> use little memory(about 1MB).
> my parameters:
> executor-memory 16g
> executor-cores 4
> num-executors 400
> "spark.yarn.executor.memoryOverhead", "8192"
> partition number: 6
> does pipe() operation use many off-heap memory? 
> the log is :
> killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> should I continue boosting spark.yarn.executor.memoryOverhead? Or there are 
> some bugs in the pipe() operation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11101) pipe() operation OOM

2015-10-14 Thread hotdog (JIRA)
hotdog created SPARK-11101:
--

 Summary: pipe() operation OOM
 Key: SPARK-11101
 URL: https://issues.apache.org/jira/browse/SPARK-11101
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
 Environment: spark on yarn
Reporter: hotdog


when using pipe() operation with large data(10TB), the pipe() operation always 
OOM. 
my parameters:
executor-memory 16g
executor-cores 4
num-executors 400
"spark.yarn.executor.memoryOverhead", "8192"
partition number: 6

does pipe() operation use many off-heap memory? 
the log is :
killed by YARN for exceeding memory limits. 24.4 GB of 24 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.

should I continue boosting spark.yarn.executor.memoryOverhead? Or there are 
some bugs in the pipe() operation?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11083) insert overwrite table failed when beeline reconnect

2015-10-14 Thread Weizhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956453#comment-14956453
 ] 

Weizhong commented on SPARK-11083:
--

I have retest on latest master branch(end at commit 
ce3f9a80657751ee0bc0ed6a9b6558acbb40af4f, [SPARK-11091] [SQL] Change 
spark.sql.canonicalizeView to spark.sql.nativeView.) and this issuse have been 
fixed. But know I don't very clear which PR fix this issue.

> insert overwrite table failed when beeline reconnect
> 
>
> Key: SPARK-11083
> URL: https://issues.apache.org/jira/browse/SPARK-11083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark: master branch
> Hadoop: 2.7.1
> JDK: 1.8.0_60
>Reporter: Weizhong
>
> 1. Start Thriftserver
> 2. Use beeline connect to thriftserver, then execute "insert overwrite 
> table_name ..." clause -- success
> 3. Exit beelin
> 4. Reconnect to thriftserver, and then execute "insert overwrite table_name 
> ..." clause. -- failed
> {noformat}
> 15/10/13 18:44:35 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v1_2.loadDynamicPartitions(HiveShim.scala:520)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply$mcV$sp(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadDynamicPartitions$1.apply(ClientWrapper.scala:506)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadDynamicPartitions(ClientWrapper.scala:505)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:225)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:739)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:224)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://9.91.8.214:9000/user/hive/warehouse/tpcds_bin_partitioned_orc_2.db/catalog_returns/.hive-sta

[jira] [Updated] (SPARK-11100) HiveThriftServer not registering with Zookeeper

2015-10-14 Thread Xiaoyu Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoyu Wang updated SPARK-11100:

Description: 
hive-site.xml config:
{code}

hive.server2.support.dynamic.service.discovery
true


hive.server2.zookeeper.namespace
sparkhiveserver2


hive.zookeeper.quorum
zk1,zk2,zk3

{code}
then start thrift server
{code}
start-thriftserver.sh --master yarn
{code}

In zookeeper znode "sparkhiveserver2" not found.
hiveserver2 is working on this config!

  was:
hive-site.xml config:

hive.server2.support.dynamic.service.discovery
true


hive.server2.zookeeper.namespace
sparkhiveserver2


hive.zookeeper.quorum
zk1,zk2,zk3


then start thrift server
start-thriftserver.sh --master yarn

In zookeeper znode "sparkhiveserver2" not found.
hiveserver2 is working on this config!


> HiveThriftServer not registering with Zookeeper
> ---
>
> Key: SPARK-11100
> URL: https://issues.apache.org/jira/browse/SPARK-11100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Hive-1.2.1
> Hadoop-2.6.0
>Reporter: Xiaoyu Wang
>
> hive-site.xml config:
> {code}
> 
> hive.server2.support.dynamic.service.discovery
> true
> 
> 
> hive.server2.zookeeper.namespace
> sparkhiveserver2
> 
> 
> hive.zookeeper.quorum
> zk1,zk2,zk3
> 
> {code}
> then start thrift server
> {code}
> start-thriftserver.sh --master yarn
> {code}
> In zookeeper znode "sparkhiveserver2" not found.
> hiveserver2 is working on this config!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11100) HiveThriftServer not registering with Zookeeper

2015-10-14 Thread Xiaoyu Wang (JIRA)
Xiaoyu Wang created SPARK-11100:
---

 Summary: HiveThriftServer not registering with Zookeeper
 Key: SPARK-11100
 URL: https://issues.apache.org/jira/browse/SPARK-11100
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
 Environment: Hive-1.2.1
Hadoop-2.6.0
Reporter: Xiaoyu Wang


hive-site.xml config:

hive.server2.support.dynamic.service.discovery
true


hive.server2.zookeeper.namespace
sparkhiveserver2


hive.zookeeper.quorum
zk1,zk2,zk3


then start thrift server
start-thriftserver.sh --master yarn

In zookeeper znode "sparkhiveserver2" not found.
hiveserver2 is working on this config!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames

2015-10-14 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956367#comment-14956367
 ] 

Xiao Li edited comment on SPARK-10925 at 10/14/15 7:16 AM:
---

Also hit the same problem, but this is not related to UDF. Trying to narrow 
down the root cause of the analyzer internal. 


was (Author: smilegator):
Also hit the same problem. Trying to narrow down the root cause of the analyzer 
internal. 

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccesso

[jira] [Updated] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11099:
---
Description: 
spark.driver.extraClassPath doesn't take effect in the latest code, and find 
the root cause is due to the default conf property file is not loaded 

The bug is caused by this code snippet in AbstractCommandBuilder
{code}
  Map getEffectiveConfig() throws IOException {
if (effectiveConfig == null) {
  if (propertiesFile == null) {
effectiveConfig = conf;   // return from here if no propertyFile is 
provided
  } else {
effectiveConfig = new HashMap<>(conf);
Properties p = loadPropertiesFile();// default propertyFile 
will load here
for (String key : p.stringPropertyNames()) {
  if (!effectiveConfig.containsKey(key)) {
effectiveConfig.put(key, p.getProperty(key));
  }
}
  }
}
return effectiveConfig;
  }
{code}

  was:
spark.driver.extraClassPath doesn't take effect in the latest code, and find 
the root cause is due to the default conf property file is not loaded 

The bug is caused by this code snippet in AbstractCommandBuilder
{code}
  Map getEffectiveConfig() throws IOException {
if (effectiveConfig == null) {
  if (propertiesFile == null) {
effectiveConfig = conf;   
  } else {
effectiveConfig = new HashMap<>(conf);
Properties p = loadPropertiesFile();
for (String key : p.stringPropertyNames()) {
  if (!effectiveConfig.containsKey(key)) {
effectiveConfig.put(key, p.getProperty(key));
  }
}
  }
}
return effectiveConfig;
  }
{code}


> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   // return from here if no propertyFile 
> is provided
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();// default propertyFile 
> will load here
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956399#comment-14956399
 ] 

Jeff Zhang commented on SPARK-11099:


Will create a pull request soon

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11099:
---
Component/s: Spark Shell

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11099:
---
Component/s: Spark Submit

> Default conf property file is not loaded 
> -
>
> Key: SPARK-11099
> URL: https://issues.apache.org/jira/browse/SPARK-11099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Reporter: Jeff Zhang
>Priority: Critical
>
> spark.driver.extraClassPath doesn't take effect in the latest code, and find 
> the root cause is due to the default conf property file is not loaded 
> The bug is caused by this code snippet in AbstractCommandBuilder
> {code}
>   Map getEffectiveConfig() throws IOException {
> if (effectiveConfig == null) {
>   if (propertiesFile == null) {
> effectiveConfig = conf;   
>   } else {
> effectiveConfig = new HashMap<>(conf);
> Properties p = loadPropertiesFile();
> for (String key : p.stringPropertyNames()) {
>   if (!effectiveConfig.containsKey(key)) {
> effectiveConfig.put(key, p.getProperty(key));
>   }
> }
>   }
> }
> return effectiveConfig;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11099) Default conf property file is not loaded

2015-10-14 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-11099:
--

 Summary: Default conf property file is not loaded 
 Key: SPARK-11099
 URL: https://issues.apache.org/jira/browse/SPARK-11099
 Project: Spark
  Issue Type: Bug
Reporter: Jeff Zhang
Priority: Critical


spark.driver.extraClassPath doesn't take effect in the latest code, and find 
the root cause is due to the default conf property file is not loaded 

The bug is caused by this code snippet in AbstractCommandBuilder
{code}
  Map getEffectiveConfig() throws IOException {
if (effectiveConfig == null) {
  if (propertiesFile == null) {
effectiveConfig = conf;   
  } else {
effectiveConfig = new HashMap<>(conf);
Properties p = loadPropertiesFile();
for (String key : p.stringPropertyNames()) {
  if (!effectiveConfig.containsKey(key)) {
effectiveConfig.put(key, p.getProperty(key));
  }
}
  }
}
return effectiveConfig;
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11098) RPC message ordering is not guaranteed

2015-10-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956396#comment-14956396
 ] 

Reynold Xin commented on SPARK-11098:
-

[~vanzin]  zsxwing told me you were working on this. Let me know if it is not 
the case.


> RPC message ordering is not guaranteed
> --
>
> Key: SPARK-11098
> URL: https://issues.apache.org/jira/browse/SPARK-11098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> NettyRpcEnv doesn't guarantee message delivery order since there are multiple 
> threads sending messages in clientConnectionExecutor thread pool. We should 
> fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11098) RPC message ordering is not guaranteed

2015-10-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11098:
---

 Summary: RPC message ordering is not guaranteed
 Key: SPARK-11098
 URL: https://issues.apache.org/jira/browse/SPARK-11098
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin


NettyRpcEnv doesn't guarantee message delivery order since there are multiple 
threads sending messages in clientConnectionExecutor thread pool. We should fix 
that.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11097) Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive

2015-10-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956394#comment-14956394
 ] 

Reynold Xin commented on SPARK-11097:
-

cc [~vanzin] do you have time to do this?


> Add connection established callback to lower level RPC layer so we don't need 
> to check for new connections in NettyRpcHandler.receive
> -
>
> Key: SPARK-11097
> URL: https://issues.apache.org/jira/browse/SPARK-11097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> I think we can remove the check for new connections in 
> NettyRpcHandler.receive if we just add a channel registered callback to the 
> lower level network module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11097) Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive

2015-10-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11097:
---

 Summary: Add connection established callback to lower level RPC 
layer so we don't need to check for new connections in NettyRpcHandler.receive
 Key: SPARK-11097
 URL: https://issues.apache.org/jira/browse/SPARK-11097
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin


I think we can remove the check for new connections in NettyRpcHandler.receive 
if we just add a channel registered callback to the lower level network module.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11096:


Assignee: Reynold Xin  (was: Apache Spark)

> Post-hoc review Netty based RPC implementation - round 2
> 
>
> Key: SPARK-11096
> URL: https://issues.apache.org/jira/browse/SPARK-11096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2

2015-10-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11096:


Assignee: Apache Spark  (was: Reynold Xin)

> Post-hoc review Netty based RPC implementation - round 2
> 
>
> Key: SPARK-11096
> URL: https://issues.apache.org/jira/browse/SPARK-11096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2

2015-10-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956390#comment-14956390
 ] 

Apache Spark commented on SPARK-11096:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9112

> Post-hoc review Netty based RPC implementation - round 2
> 
>
> Key: SPARK-11096
> URL: https://issues.apache.org/jira/browse/SPARK-11096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11096) Post-hoc review Netty based RPC implementation - round 2

2015-10-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11096:
---

 Summary: Post-hoc review Netty based RPC implementation - round 2
 Key: SPARK-11096
 URL: https://issues.apache.org/jira/browse/SPARK-11096
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2