[jira] [Commented] (SPARK-15579) SparkUI: Storage page is empty even if things are cached

2016-05-26 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302930#comment-15302930
 ] 

Andrew Or commented on SPARK-15579:
---

I tried this on 0f61d6efb45b9ee94fa663f67c4489fbdae2eded, which is literally 
the latest commit as of the writing of this message

> SparkUI: Storage page is empty even if things are cached
> 
>
> Key: SPARK-15579
> URL: https://issues.apache.org/jira/browse/SPARK-15579
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> scala> sc.parallelize(1 to 1, 5000).cache().count()
> SparkUI storage page is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15579) SparkUI: Storage page is empty even if things are cached

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15579:
--
Description: 
scala> sc.parallelize(1 to 1, 5000).cache().count()

SparkUI storage page is empty.

> SparkUI: Storage page is empty even if things are cached
> 
>
> Key: SPARK-15579
> URL: https://issues.apache.org/jira/browse/SPARK-15579
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> scala> sc.parallelize(1 to 1, 5000).cache().count()
> SparkUI storage page is empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15579) SparkUI: Storage page is empty even if things are cached

2016-05-26 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15579:
-

 Summary: SparkUI: Storage page is empty even if things are cached
 Key: SPARK-15579
 URL: https://issues.apache.org/jira/browse/SPARK-15579
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 2.0.0
Reporter: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15552) Remove unnecessary private[sql] methods in SparkSession

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15552.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Remove unnecessary private[sql] methods in SparkSession
> ---
>
> Key: SPARK-15552
> URL: https://issues.apache.org/jira/browse/SPARK-15552
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> SparkSession has a list of unnecessary private[sql] methods. These methods 
> cause some trouble because private[sql] doesn't apply in Java. In the cases 
> that they are easy to remove, we can simply remove them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15576) Add back hive tests blacklisted by SPARK-15539

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15576:
--
Assignee: (was: Andrew Or)

> Add back hive tests blacklisted by SPARK-15539
> --
>
> Key: SPARK-15576
> URL: https://issues.apache.org/jira/browse/SPARK-15576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> These were removed from HiveCompatibilitySuite. They should be added back to 
> HiveQuerySuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15520) SparkSession builder in python should also allow overriding confs of existing sessions

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15520:
--
Assignee: Eric Liang

> SparkSession builder in python should also allow overriding confs of existing 
> sessions
> --
>
> Key: SPARK-15520
> URL: https://issues.apache.org/jira/browse/SPARK-15520
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> This is a leftover TODO from the SparkSession clean in this PR: 
> https://github.com/apache/spark/pull/13200



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15520) SparkSession builder in python should also allow overriding confs of existing sessions

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15520.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> SparkSession builder in python should also allow overriding confs of existing 
> sessions
> --
>
> Key: SPARK-15520
> URL: https://issues.apache.org/jira/browse/SPARK-15520
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> This is a leftover TODO from the SparkSession clean in this PR: 
> https://github.com/apache/spark/pull/13200



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15539) DROP TABLE should throw exceptions, not logError

2016-05-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15539.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> DROP TABLE should throw exceptions, not logError
> 
>
> Key: SPARK-15539
> URL: https://issues.apache.org/jira/browse/SPARK-15539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 2.0.0
>
>
> Same as SPARK-15534 but for DROP TABLE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15576) Add back hive tests blacklisted by SPARK-15539

2016-05-26 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15576:
-

 Summary: Add back hive tests blacklisted by SPARK-15539
 Key: SPARK-15576
 URL: https://issues.apache.org/jira/browse/SPARK-15576
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


These were removed from HiveCompatibilitySuite. They should be added back to 
HiveQuerySuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15506) only one notebook can define a UDF; java.sql.SQLException: Another instance of Derby may have already booted the database

2016-05-26 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302359#comment-15302359
 ] 

Andrew Davidson commented on SPARK-15506:
-

Hi Jeff

Here is how I start the notebook server. I believe spark uses jupyter

$SPARK_ROOT/bin/pyspark


Can you tell me where I can find out more about configuration details? I do not 
think the issue is multiple users. I discovered the bug while running two 
notebooks on my local machine. I.E. I was running both notebooks. It seem like 
the each notebook server needs it own data base?

Kind regards

Andy

p.s. even in our data center I start the notebook server the same way. I am the 
only data scientist 


> only one notebook can define a UDF; java.sql.SQLException: Another instance 
> of Derby may have already booted the database
> -
>
> Key: SPARK-15506
> URL: https://issues.apache.org/jira/browse/SPARK-15506
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: Mac OSX El Captain
> Python 3.4.2
>Reporter: Andrew Davidson
>
> I am using a sqlContext to create dataframes. I noticed that if I open up and 
> run 'notebook a' and 'a' defines a udf. That I will not be able to open a 
> second notebook that also defines a udf unless I shut down notebook a first.
> In the second notebook I get a big long stack trace.  The problem seems to be
> Caused by: java.sql.SQLException: Another instance of Derby may have already 
> booted the database 
> /Users/andrewdavidson/workSpace/bigPWSWorkspace/dataScience/notebooks/gnip/metastore_db.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   ... 86 more
> Here is the complete stack track
> Kind regards
> Andy
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>  16 #fooUDF = udf(lambda arg : "aedwip")
>  17 
> ---> 18 paddedStrUDF = udf(lambda zipInt : str(zipInt).zfill(5))
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
>  in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598 
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
>  in __init__(self, func, returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559 
>1560 def _create_judf(self, name):
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
>  in _create_judf(self, name)
>1567 pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command, self)
>1568 ctx = SQLContext.getOrCreate(sc)
> -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
>1570 if name is None:
>1571 name = f.__name__ if hasattr(f, '__name__') else 
> f.__class__.__name__
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py
>  in _ssql_ctx(self)
> 681 try:
> 682 if not hasattr(self, '_scala_HiveContext'):
> --> 683 self._scala_HiveContext = self._get_hive_ctx()
> 684 return self._scala_HiveContext
> 685 except Py4JError as e:
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py
>  in _get_hive_ctx(self)
> 690 
> 691 def _get_hive_ctx(self):
> --> 692 return self._jvm.HiveContext(self._jsc.sc())
> 693 
> 694 def refreshTable(self, tableName):
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1062 answer = self._gateway_client.send_command(command)
>1063 return_value = get_return_value(
> -> 1064 answer, self._gateway_client, None, self._fqn)
>1065 
>1066 for temp_arg in temp_args:
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/utils.py
>  

[jira] [Assigned] (SPARK-15536) Disallow TRUNCATE TABLE with external tables and views

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-15536:
-

Assignee: Andrew Or  (was: Suresh Thalamati)

> Disallow TRUNCATE TABLE with external tables and views
> --
>
> Key: SPARK-15536
> URL: https://issues.apache.org/jira/browse/SPARK-15536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Otherwise we might accidentally delete existing data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15538) Truncate table does not work on data source table

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15538:
--
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> Truncate table does not work on data source table
> -
>
> Key: SPARK-15538
> URL: https://issues.apache.org/jira/browse/SPARK-15538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Suresh Thalamati
>Assignee: Andrew Or
>Priority: Minor
>
> Truncate table does not seem to work on data source tables.
> Repro:
> {code}
> val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
> "CA")).toDF("id", "name", "state")
> df.write.format("parquet").partitionBy("state").saveAsTable("emp")
> scala> sql("truncate table emp") 
> res8: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from emp").show() // FileNotFoundException
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15536) Disallow TRUNCATE TABLE with external tables and views

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15536:
--
Summary: Disallow TRUNCATE TABLE with external tables and views  (was: 
Disallow TRUNCATE TABLE with external tables)

> Disallow TRUNCATE TABLE with external tables and views
> --
>
> Key: SPARK-15536
> URL: https://issues.apache.org/jira/browse/SPARK-15536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Suresh Thalamati
>
> Otherwise we might accidentally delete existing data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15538) Truncate table does not work on data source table

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15538:
--
Description: 
Truncate table does not seem to work on data source tables.

Repro:
{code}
val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
"CA")).toDF("id", "name", "state")
df.write.format("parquet").partitionBy("state").saveAsTable("emp")

scala> sql("truncate table emp") 
res8: org.apache.spark.sql.DataFrame = []

scala> sql("select * from emp").show() // FileNotFoundException
{code} 

  was:
Truncate table does not  seems to work on data source table. It returns success 
without any error , but table is not truncated. 

Repro:
{code}
val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
"CA")).toDF("id", "name", "state")
df.write.format("parquet").partitionBy("state").saveAsTable("emp")

scala> sql("truncate table emp") 
res8: org.apache.spark.sql.DataFrame = []

scala> sql("select * from emp").show() // FileNotFoundException
{code} 


> Truncate table does not work on data source table
> -
>
> Key: SPARK-15538
> URL: https://issues.apache.org/jira/browse/SPARK-15538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Suresh Thalamati
>Assignee: Andrew Or
>Priority: Minor
>
> Truncate table does not seem to work on data source tables.
> Repro:
> {code}
> val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
> "CA")).toDF("id", "name", "state")
> df.write.format("parquet").partitionBy("state").saveAsTable("emp")
> scala> sql("truncate table emp") 
> res8: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from emp").show() // FileNotFoundException
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15538) Truncate table does not work on data source table

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15538:
--
Description: 
Truncate table does not  seems to work on data source table. It returns success 
without any error , but table is not truncated. 

Repro:
{code}
val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
"CA")).toDF("id", "name", "state")
df.write.format("parquet").partitionBy("state").saveAsTable("emp")

scala> sql("truncate table emp") 
res8: org.apache.spark.sql.DataFrame = []

scala> sql("select * from emp").show() // FileNotFoundException
{code} 

  was:
Truncate table does not  seems to work on data source table. It returns success 
without any error , but table is not truncated. 

Repro:
{code}
val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
"CA")).toDF("id", "name", "state")
df.write.format("parquet").partitionBy("state").saveAsTable("emp")

scala> sql("truncate table emp") 
res8: org.apache.spark.sql.DataFrame = []

scala> sql("select * from emp").show ;
+---+--+-+
| id|  name|state|
+---+--+-+
|  3|Robert|   CA|
|  1|  john|   CA|
|  2|  Mike|   NY|
+---+--+-+

{code} 

The select should have returned no results. 

By scanning through  the code  I found  some of the other DDL commands like 
LOAD DATA ,  and SHOW PARTITIONS are not allowed for data source table and they 
raise error. 

It  Might be good to throw error until the truncate table works with  data 
source table also.
 


> Truncate table does not work on data source table
> -
>
> Key: SPARK-15538
> URL: https://issues.apache.org/jira/browse/SPARK-15538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Suresh Thalamati
>Assignee: Andrew Or
>Priority: Minor
>
> Truncate table does not  seems to work on data source table. It returns 
> success without any error , but table is not truncated. 
> Repro:
> {code}
> val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
> "CA")).toDF("id", "name", "state")
> df.write.format("parquet").partitionBy("state").saveAsTable("emp")
> scala> sql("truncate table emp") 
> res8: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from emp").show() // FileNotFoundException
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15538) Truncate table does not work on data source table

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15538:
--
Summary: Truncate table does not work on data source table  (was: Truncate 
table does not work on data source table , and does not raise error either.)

> Truncate table does not work on data source table
> -
>
> Key: SPARK-15538
> URL: https://issues.apache.org/jira/browse/SPARK-15538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Suresh Thalamati
>Assignee: Suresh Thalamati
>Priority: Minor
>
> Truncate table does not  seems to work on data source table. It returns 
> success without any error , but table is not truncated. 
> Repro:
> {code}
> val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
> "CA")).toDF("id", "name", "state")
> df.write.format("parquet").partitionBy("state").saveAsTable("emp")
> scala> sql("truncate table emp") 
> res8: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from emp").show ;
> +---+--+-+
> | id|  name|state|
> +---+--+-+
> |  3|Robert|   CA|
> |  1|  john|   CA|
> |  2|  Mike|   NY|
> +---+--+-+
> {code} 
> The select should have returned no results. 
> By scanning through  the code  I found  some of the other DDL commands like 
> LOAD DATA ,  and SHOW PARTITIONS are not allowed for data source table and 
> they raise error. 
> It  Might be good to throw error until the truncate table works with  data 
> source table also.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15538) Truncate table does not work on data source table

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-15538:
-

Assignee: Andrew Or  (was: Suresh Thalamati)

> Truncate table does not work on data source table
> -
>
> Key: SPARK-15538
> URL: https://issues.apache.org/jira/browse/SPARK-15538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Suresh Thalamati
>Assignee: Andrew Or
>Priority: Minor
>
> Truncate table does not  seems to work on data source table. It returns 
> success without any error , but table is not truncated. 
> Repro:
> {code}
> val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
> "CA")).toDF("id", "name", "state")
> df.write.format("parquet").partitionBy("state").saveAsTable("emp")
> scala> sql("truncate table emp") 
> res8: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from emp").show ;
> +---+--+-+
> | id|  name|state|
> +---+--+-+
> |  3|Robert|   CA|
> |  1|  john|   CA|
> |  2|  Mike|   NY|
> +---+--+-+
> {code} 
> The select should have returned no results. 
> By scanning through  the code  I found  some of the other DDL commands like 
> LOAD DATA ,  and SHOW PARTITIONS are not allowed for data source table and 
> they raise error. 
> It  Might be good to throw error until the truncate table works with  data 
> source table also.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15539) DROP TABLE should throw exceptions, not logError

2016-05-25 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15539:
-

 Summary: DROP TABLE should throw exceptions, not logError
 Key: SPARK-15539
 URL: https://issues.apache.org/jira/browse/SPARK-15539
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor


Same as SPARK-15534 but for DROP TABLE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15538) Truncate table does not work on data source table , and does not raise error either.

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15538:
--
Assignee: Suresh Thalamati

> Truncate table does not work on data source table , and does not raise error 
> either.
> 
>
> Key: SPARK-15538
> URL: https://issues.apache.org/jira/browse/SPARK-15538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Suresh Thalamati
>Assignee: Suresh Thalamati
>Priority: Minor
>
> Truncate table does not  seems to work on data source table. It returns 
> success without any error , but table is not truncated. 
> Repro:
> {code}
> val df = Seq((1 , "john", "CA") ,(2,"Mike", "NY"), (3, "Robert", 
> "CA")).toDF("id", "name", "state")
> df.write.format("parquet").partitionBy("state").saveAsTable("emp")
> scala> sql("truncate table emp") 
> res8: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from emp").show ;
> +---+--+-+
> | id|  name|state|
> +---+--+-+
> |  3|Robert|   CA|
> |  1|  john|   CA|
> |  2|  Mike|   NY|
> +---+--+-+
> {code} 
> The select should have returned no results. 
> By scanning through  the code  I found  some of the other DDL commands like 
> LOAD DATA ,  and SHOW PARTITIONS are not allowed for data source table and 
> they raise error. 
> It  Might be good to throw error until the truncate table works with  data 
> source table also.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15535) Remove code for TRUNCATE TABLE ... COLUMN

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15535:
--
Priority: Minor  (was: Major)

> Remove code for TRUNCATE TABLE ... COLUMN
> -
>
> Key: SPARK-15535
> URL: https://issues.apache.org/jira/browse/SPARK-15535
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> This was never supported in the first place. Also Hive doesn't support it: 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15534) TRUNCATE TABLE should throw exceptions, not logError

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15534:
--
Priority: Minor  (was: Major)

> TRUNCATE TABLE should throw exceptions, not logError
> 
>
> Key: SPARK-15534
> URL: https://issues.apache.org/jira/browse/SPARK-15534
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> If the table to truncate doesn't exist, throw an exception!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15536) Disallow TRUNCATE TABLE with external tables

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15536:
--
Assignee: Suresh Thalamati  (was: Andrew Or)

> Disallow TRUNCATE TABLE with external tables
> 
>
> Key: SPARK-15536
> URL: https://issues.apache.org/jira/browse/SPARK-15536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Suresh Thalamati
>
> Otherwise we might accidentally delete existing data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15536) Disallow TRUNCATE TABLE with external tables

2016-05-25 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15536:
-

 Summary: Disallow TRUNCATE TABLE with external tables
 Key: SPARK-15536
 URL: https://issues.apache.org/jira/browse/SPARK-15536
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


Otherwise we might accidentally delete existing data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15535) Remove code for TRUNCATE TABLE ... COLUMN

2016-05-25 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15535:
-

 Summary: Remove code for TRUNCATE TABLE ... COLUMN
 Key: SPARK-15535
 URL: https://issues.apache.org/jira/browse/SPARK-15535
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


This was never supported in the first place. Also Hive doesn't support it: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15534) TRUNCATE TABLE should throw exceptions, not logError

2016-05-25 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15534:
-

 Summary: TRUNCATE TABLE should throw exceptions, not logError
 Key: SPARK-15534
 URL: https://issues.apache.org/jira/browse/SPARK-15534
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


If the table to truncate doesn't exist, throw an exception!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15345.
---
Resolution: Fixed
  Assignee: Jeff Zhang  (was: Reynold Xin)

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Piotr Milanowski
>Assignee: Jeff Zhang
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  is now cleaned +++
> 

[jira] [Updated] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15345:
--
Assignee: Reynold Xin  (was: Jeff Zhang)

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  is now cleaned +++
> 16/05/16 12:17:47 DEBUG 

[jira] [Resolved] (SPARK-15511) Dropping data source table succeeds but throws exception

2016-05-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15511.
---
Resolution: Not A Problem
  Assignee: Andrew Or

If you run into this issue again, just delete $SPARK_HOME/metastore_db

> Dropping data source table succeeds but throws exception
> 
>
> Key: SPARK-15511
> URL: https://issues.apache.org/jira/browse/SPARK-15511
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> If the catalog is backed by Hive:
> {code}
> scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV")
> {code}
> {code}
> scala> sql("DROP TABLE boxes")
> 16/05/24 13:30:50 WARN DropTableCommand: 
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/user/hive/warehouse/boxes;
> com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/user/hive/warehouse/boxes;
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:170)
> ...
> Caused by: org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/user/hive/warehouse/boxes;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:317)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:306)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anon$1.load(HiveMetastoreCatalog.scala:133)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$$anon$1.load(HiveMetastoreCatalog.scala:69)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15511) Dropping data source table succeeds but throws exception

2016-05-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15511:
--
Description: 
If the catalog is backed by Hive:

{code}
scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV")
{code}

{code}
scala> sql("DROP TABLE boxes")
16/05/24 13:30:50 WARN DropTableCommand: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/user/hive/warehouse/boxes;
com.google.common.util.concurrent.UncheckedExecutionException: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/user/hive/warehouse/boxes;
at 
com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:170)
...
Caused by: org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/user/hive/warehouse/boxes;
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:317)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:306)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anon$1.load(HiveMetastoreCatalog.scala:133)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anon$1.load(HiveMetastoreCatalog.scala:69)
{code}

  was:
{code}
scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV")
{code}

{code}
scala> sql("DROP TABLE boxes")
16/05/24 13:30:50 WARN DropTableCommand: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/user/hive/warehouse/boxes;
com.google.common.util.concurrent.UncheckedExecutionException: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/user/hive/warehouse/boxes;
at 
com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:170)
...
Caused by: org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/user/hive/warehouse/boxes;
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:317)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:306)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anon$1.load(HiveMetastoreCatalog.scala:133)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anon$1.load(HiveMetastoreCatalog.scala:69)
{code}


> Dropping data source table succeeds but throws exception
> 
>
> Key: SPARK-15511
> URL: https://issues.apache.org/jira/browse/SPARK-15511
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> If the catalog is backed by Hive:
> {code}
> scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV")
> {code}
> {code}
> scala> sql("DROP TABLE boxes")
> 16/05/24 13:30:50 WARN DropTableCommand: 
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/user/hive/warehouse/boxes;
> com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/user/hive/warehouse/boxes;
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:170)
> 

[jira] [Created] (SPARK-15511) Dropping data source table succeeds but throws exception

2016-05-24 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15511:
-

 Summary: Dropping data source table succeeds but throws exception
 Key: SPARK-15511
 URL: https://issues.apache.org/jira/browse/SPARK-15511
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or


{code}
scala> sql("CREATE TABLE boxes (width INT, length INT, height INT) USING CSV")
{code}

{code}
scala> sql("DROP TABLE boxes")
16/05/24 13:30:50 WARN DropTableCommand: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/user/hive/warehouse/boxes;
com.google.common.util.concurrent.UncheckedExecutionException: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/user/hive/warehouse/boxes;
at 
com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:170)
...
Caused by: org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/user/hive/warehouse/boxes;
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:317)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:306)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anon$1.load(HiveMetastoreCatalog.scala:133)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog$$anon$1.load(HiveMetastoreCatalog.scala:69)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15388) spark sql "CREATE FUNCTION" throws exception with hive 1.2.1

2016-05-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15388.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> spark sql "CREATE FUNCTION" throws exception with hive 1.2.1
> 
>
> Key: SPARK-15388
> URL: https://issues.apache.org/jira/browse/SPARK-15388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yang Wang
>Assignee: Yang Wang
> Fix For: 2.0.0
>
>
> spark.sql("CREATE FUNCTION MY_FUNCTION_1 AS 
> 'com.haizhi.bdp.udf.UDFGetGeoCode'") throws 
> org.apache.spark.sql.AnalysisException. 
> I was using hive whose version is 1.2.1
> Full stack trace is as follows:
>  Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:NoSuchObjectException(message:Function 
> bdp.GET_GEO_CODE does not exist));
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:71)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.functionExists(HiveExternalCatalog.scala:323)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.functionExists(SessionCatalog.scala:712)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createFunction(SessionCatalog.scala:663)
>  at 
> org.apache.spark.sql.execution.command.CreateFunction.run(functions.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:168)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15388) spark sql "CREATE FUNCTION" throws exception with hive 1.2.1

2016-05-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15388:
--
Assignee: Yang Wang

> spark sql "CREATE FUNCTION" throws exception with hive 1.2.1
> 
>
> Key: SPARK-15388
> URL: https://issues.apache.org/jira/browse/SPARK-15388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yang Wang
>Assignee: Yang Wang
> Fix For: 2.0.0
>
>
> spark.sql("CREATE FUNCTION MY_FUNCTION_1 AS 
> 'com.haizhi.bdp.udf.UDFGetGeoCode'") throws 
> org.apache.spark.sql.AnalysisException. 
> I was using hive whose version is 1.2.1
> Full stack trace is as follows:
>  Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:NoSuchObjectException(message:Function 
> bdp.GET_GEO_CODE does not exist));
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:71)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.functionExists(HiveExternalCatalog.scala:323)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.functionExists(SessionCatalog.scala:712)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createFunction(SessionCatalog.scala:663)
>  at 
> org.apache.spark.sql.execution.command.CreateFunction.run(functions.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:168)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15506) only one notebook can define a UDF; java.sql.SQLException: Another instance of Derby may have already booted the database

2016-05-24 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-15506:
---

 Summary: only one notebook can define a UDF; 
java.sql.SQLException: Another instance of Derby may have already booted the 
database
 Key: SPARK-15506
 URL: https://issues.apache.org/jira/browse/SPARK-15506
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.1
 Environment: Mac OSX El Captain
Python 3.4.2

Reporter: Andrew Davidson


I am using a sqlContext to create dataframes. I noticed that if I open up and 
run 'notebook a' and 'a' defines a udf. That I will not be able to open a 
second notebook that also defines a udf unless I shut down notebook a first.

In the second notebook I get a big long stack trace.  The problem seems to be

Caused by: java.sql.SQLException: Another instance of Derby may have already 
booted the database 
/Users/andrewdavidson/workSpace/bigPWSWorkspace/dataScience/notebooks/gnip/metastore_db.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
... 86 more



Here is the complete stack track

Kind regards

Andy

You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
assembly
---
Py4JJavaError Traceback (most recent call last)
 in ()
 16 #fooUDF = udf(lambda arg : "aedwip")
 17 
---> 18 paddedStrUDF = udf(lambda zipInt : str(zipInt).zfill(5))

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
 in udf(f, returnType)
   1595 [Row(slen=5), Row(slen=3)]
   1596 """
-> 1597 return UserDefinedFunction(f, returnType)
   1598 
   1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
 in __init__(self, func, returnType, name)
   1556 self.returnType = returnType
   1557 self._broadcast = None
-> 1558 self._judf = self._create_judf(name)
   1559 
   1560 def _create_judf(self, name):

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
 in _create_judf(self, name)
   1567 pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command, self)
   1568 ctx = SQLContext.getOrCreate(sc)
-> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
   1570 if name is None:
   1571 name = f.__name__ if hasattr(f, '__name__') else 
f.__class__.__name__

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py
 in _ssql_ctx(self)
681 try:
682 if not hasattr(self, '_scala_HiveContext'):
--> 683 self._scala_HiveContext = self._get_hive_ctx()
684 return self._scala_HiveContext
685 except Py4JError as e:

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py
 in _get_hive_ctx(self)
690 
691 def _get_hive_ctx(self):
--> 692 return self._jvm.HiveContext(self._jsc.sc())
693 
694 def refreshTable(self, tableName):

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1062 answer = self._gateway_client.send_command(command)
   1063 return_value = get_return_value(
-> 1064 answer, self._gateway_client, None, self._fqn)
   1065 
   1066 for temp_arg in temp_args:

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/utils.py
 in deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at 

[jira] [Commented] (SPARK-15450) Clean up SparkSession builder for python

2016-05-24 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297800#comment-15297800
 ] 

Andrew Or commented on SPARK-15450:
---

Actually I already have some ideas for this one.

> Clean up SparkSession builder for python
> 
>
> Key: SPARK-15450
> URL: https://issues.apache.org/jira/browse/SPARK-15450
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is the sister JIRA for SPARK-15075. Today we use 
> `SQLContext.getOrCreate` in our builder. Instead we should just have a real 
> `SparkSession.getOrCreate` and use that in our builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15464) Replace SQLContext and SparkContext with SparkSession using builder pattern in python testsuites

2016-05-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15464.
---
  Resolution: Fixed
Assignee: Weichen Xu
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Replace SQLContext and SparkContext with SparkSession using builder pattern 
> in python testsuites
> 
>
> Key: SPARK-15464
> URL: https://issues.apache.org/jira/browse/SPARK-15464
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
>  Labels: test
> Fix For: 2.0.0
>
>
> In python script, there are several positions still using SQLContext and 
> directly create SparkContext, it should be replace with SparkSession using 
> SparkSession.builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15311) Disallow DML on Non-temporary Tables when Using In-Memory Catalog

2016-05-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15311.
---
  Resolution: Fixed
Assignee: Xiao Li
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Disallow DML on Non-temporary Tables when Using In-Memory Catalog
> -
>
> Key: SPARK-15311
> URL: https://issues.apache.org/jira/browse/SPARK-15311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> So far, when using In-Memory Catalog, we allow DDL operations for 
> non-temporary tables. However, the corresponding DML operations are not 
> supported. Thus, we need to issue exceptions in this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15488) Possible Accumulator bug causing OneVsRestSuite to be flaky

2016-05-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15488.
---
  Resolution: Fixed
Assignee: Liang-Chi Hsieh
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Possible Accumulator bug causing OneVsRestSuite to be flaky
> ---
>
> Key: SPARK-15488
> URL: https://issues.apache.org/jira/browse/SPARK-15488
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 2.0.0
> Environment: Jenkins: branch-2.0, maven build, Hadoop 2.6
>Reporter: Joseph K. Bradley
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> OneVsRestSuite has been slightly flaky recently.  The failure happens in the 
> use of {{Range.par}}, which executes concurrent jobs which use the same 
> DataFrame.  This sometimes causes failures from 
> {{java.util.ConcurrentModificationException}}.
> It appears the failure is from {{InMemoryRelation.batchStats}} being 
> accessed.  Since that is an instance of {{Accumulable}}, I'm guessing the bug 
> is from recent Accumulator changes.
> Stack trace from this test run.
> * links: [https://spark-tests.appspot.com/test-logs/125719479] and 
> [https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/993]
> {code}
>   java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>   at java.util.ArrayList$Itr.next(ArrayList.java:851)
>   at 
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation.computeSizeInBytes(InMemoryTableScanExec.scala:90)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation.statistics(InMemoryTableScanExec.scala:113)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation.statisticsToBePropagated(InMemoryTableScanExec.scala:97)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation.withOutput(InMemoryTableScanExec.scala:191)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:144)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:144)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:144)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:141)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> 

[jira] [Resolved] (SPARK-15279) Disallow ROW FORMAT and STORED AS (parquet | orc | avro etc.)

2016-05-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15279.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Disallow ROW FORMAT and STORED AS (parquet | orc | avro etc.)
> -
>
> Key: SPARK-15279
> URL: https://issues.apache.org/jira/browse/SPARK-15279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> They are both potentially conflicting ways that allow you to specify the 
> SerDe. Unfortunately, we can't just get rid of ROW FORMAT because it may be 
> used with TEXTFILE or RCFILE. For other file formats, we should fail fast 
> wherever possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15397) 'locate' UDF got different result with boundary value case compared to Hive engine

2016-05-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15397:
--
Summary: 'locate' UDF got different result with boundary value case 
compared to Hive engine  (was: [Spark][SQL] 'locate' UDF got different result 
with boundary value case compared to Hive engine)

> 'locate' UDF got different result with boundary value case compared to Hive 
> engine
> --
>
> Key: SPARK-15397
> URL: https://issues.apache.org/jira/browse/SPARK-15397
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Yi Zhou
>
> Spark SQL:
> select locate("abc", "abc", 1);
> 0
> Hive:
> select locate("abc", "abc", 1);
> 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15397) 'locate' UDF got different result with boundary value case compared to Hive engine

2016-05-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15397:
--
Assignee: Adrian Wang

> 'locate' UDF got different result with boundary value case compared to Hive 
> engine
> --
>
> Key: SPARK-15397
> URL: https://issues.apache.org/jira/browse/SPARK-15397
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Yi Zhou
>Assignee: Adrian Wang
>
> Spark SQL:
> select locate("abc", "abc", 1);
> 0
> Hive:
> select locate("abc", "abc", 1);
> 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15477) HiveContext is private[hive] and not accessible to users.

2016-05-23 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15477.
---
Resolution: Not A Bug

> HiveContext is private[hive] and not accessible to users. 
> --
>
> Key: SPARK-15477
> URL: https://issues.apache.org/jira/browse/SPARK-15477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Doug Balog
>
> In 2.0 org.apache.spark.sql.hive.HiveContext was mark deprecated but should 
> still be accessible from user programs. It is not since, its marked at 
> `private[hive]` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15477) HiveContext is private[hive] and not accessible to users.

2016-05-23 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296763#comment-15296763
 ] 

Andrew Or commented on SPARK-15477:
---

What part of the code makes you think it's private[hive]?
https://github.com/apache/spark/blob/branch-2.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

> HiveContext is private[hive] and not accessible to users. 
> --
>
> Key: SPARK-15477
> URL: https://issues.apache.org/jira/browse/SPARK-15477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Doug Balog
>
> In 2.0 org.apache.spark.sql.hive.HiveContext was mark deprecated but should 
> still be accessible from user programs. It is not since, its marked at 
> `private[hive]` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15456) PySpark Shell fails to create SparkContext if HiveConf not found

2016-05-20 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15456.
---
  Resolution: Fixed
Assignee: Bryan Cutler
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> PySpark Shell fails to create SparkContext if HiveConf not found
> 
>
> Key: SPARK-15456
> URL: https://issues.apache.org/jira/browse/SPARK-15456
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.0.0
>
>
> When starting the PySpark shell, if HiveConf is not available then will fall 
> back to create a SparkSession from a SparkContext.  This is attempted with 
> the variable {{sc}} which hasn't been initialized yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15450) Clean up SparkSession builder for python

2016-05-20 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15450:
-

 Summary: Clean up SparkSession builder for python
 Key: SPARK-15450
 URL: https://issues.apache.org/jira/browse/SPARK-15450
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


This is the sister JIRA for SPARK-15075. Today we use `SQLContext.getOrCreate` 
in our builder. Instead we should just have a real `SparkSession.getOrCreate` 
and use that in our builder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-20 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293899#comment-15293899
 ] 

Andrew Or commented on SPARK-15345:
---

The python part should be resolved by SPARK-15417, 
https://github.com/apache/spark/pull/13203

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> 

[jira] [Updated] (SPARK-15417) Failed to enable hive support in PySpark shell

2016-05-20 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15417:
--
Summary: Failed to enable hive support in PySpark shell  (was: Failed to 
enable HiveSupport in PySpark)

> Failed to enable hive support in PySpark shell
> --
>
> Key: SPARK-15417
> URL: https://issues.apache.org/jira/browse/SPARK-15417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Unable to use Hive meta-store in pyspark shell. Tried both HiveContext and 
> SparkSession. Both failed. It always uses in-memory catalog.
> Method 1: Using SparkSession
> {noformat}
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> >>> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> DataFrame[]
> >>> spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
> >>> INTO TABLE src")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py", 
> line 494, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 
> 57, in deco
> return f(*a, **kw)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
> : java.lang.UnsupportedOperationException: loadTable is not implemented
> at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)
> at org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> at org.apache.spark.sql.Dataset.(Dataset.scala:187)
> at org.apache.spark.sql.Dataset.(Dataset.scala:168)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Method 2: Using HiveContext: 
> {noformat}
> >>> from pyspark.sql import HiveContext
> >>> sqlContext = HiveContext(sc)
> >>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> DataFrame[]
> >>> sqlContext.sql("LOAD DATA LOCAL INPATH 
> >>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/context.py", 
> line 346, in sql
> return self.sparkSession.sql(sqlQuery)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py", 
> line 494, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery), 

[jira] [Resolved] (SPARK-15417) Failed to enable HiveSupport in PySpark

2016-05-20 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15417.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Failed to enable HiveSupport in PySpark
> ---
>
> Key: SPARK-15417
> URL: https://issues.apache.org/jira/browse/SPARK-15417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Unable to use Hive meta-store in pyspark shell. Tried both HiveContext and 
> SparkSession. Both failed. It always uses in-memory catalog.
> Method 1: Using SparkSession
> {noformat}
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> >>> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> DataFrame[]
> >>> spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
> >>> INTO TABLE src")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py", 
> line 494, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 
> 57, in deco
> return f(*a, **kw)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
> : java.lang.UnsupportedOperationException: loadTable is not implemented
> at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)
> at org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> at org.apache.spark.sql.Dataset.(Dataset.scala:187)
> at org.apache.spark.sql.Dataset.(Dataset.scala:168)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Method 2: Using HiveContext: 
> {noformat}
> >>> from pyspark.sql import HiveContext
> >>> sqlContext = HiveContext(sc)
> >>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> DataFrame[]
> >>> sqlContext.sql("LOAD DATA LOCAL INPATH 
> >>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/context.py", 
> line 346, in sql
> return self.sparkSession.sql(sqlQuery)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py", 
> line 494, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>   File 
> 

[jira] [Resolved] (SPARK-15421) Table and Database property values need to be validated

2016-05-20 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15421.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Table and Database property values need to be validated
> ---
>
> Key: SPARK-15421
> URL: https://issues.apache.org/jira/browse/SPARK-15421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> When we parse DDLs involving table or database properties, we need to 
> validate the values.
> E.g. if we alter a database's property without providing a value:
> {code}
> ALTER DATABASE my_db SET DBPROPERTIES('some_key')
> {code}
> Then we'll ignore it with Hive, but override the property with the in-memory 
> catalog. Inconsistencies like these arise because we don't validate the 
> property values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15421) Table and Database property values need to be validated

2016-05-19 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15421:
-

 Summary: Table and Database property values need to be validated
 Key: SPARK-15421
 URL: https://issues.apache.org/jira/browse/SPARK-15421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


When we parse DDLs involving table or database properties, we need to validate 
the values.

E.g. if we alter a database's property without providing a value:
{code}
ALTER DATABASE my_db SET DBPROPERTIES('some_key')
{code}

Then we'll ignore it with Hive, but override the property with the in-memory 
catalog. Inconsistencies like these arise because we don't validate the 
property values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15417) Failed to enable HiveSupport in PySpark

2016-05-19 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292159#comment-15292159
 ] 

Andrew Or commented on SPARK-15417:
---

Good catch, I have a patch to fix this.

> Failed to enable HiveSupport in PySpark
> ---
>
> Key: SPARK-15417
> URL: https://issues.apache.org/jira/browse/SPARK-15417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Priority: Blocker
>
> Unable to use Hive meta-store in pyspark shell. Tried both HiveContext and 
> SparkSession. Both failed. It always uses in-memory catalog.
> Method 1: Using SparkSession
> {noformat}
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> >>> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> DataFrame[]
> >>> spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
> >>> INTO TABLE src")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py", 
> line 494, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 
> 57, in deco
> return f(*a, **kw)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
> : java.lang.UnsupportedOperationException: loadTable is not implemented
> at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)
> at org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> at org.apache.spark.sql.Dataset.(Dataset.scala:187)
> at org.apache.spark.sql.Dataset.(Dataset.scala:168)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Method 2: Using HiveContext: 
> {noformat}
> >>> from pyspark.sql import HiveContext
> >>> sqlContext = HiveContext(sc)
> >>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> DataFrame[]
> >>> sqlContext.sql("LOAD DATA LOCAL INPATH 
> >>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/context.py", 
> line 346, in sql
> return self.sparkSession.sql(sqlQuery)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py", 
> line 494, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>   File 
> 

[jira] [Assigned] (SPARK-15417) Failed to enable HiveSupport in PySpark

2016-05-19 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-15417:
-

Assignee: Andrew Or

> Failed to enable HiveSupport in PySpark
> ---
>
> Key: SPARK-15417
> URL: https://issues.apache.org/jira/browse/SPARK-15417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Andrew Or
>Priority: Blocker
>
> Unable to use Hive meta-store in pyspark shell. Tried both HiveContext and 
> SparkSession. Both failed. It always uses in-memory catalog.
> Method 1: Using SparkSession
> {noformat}
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> >>> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> DataFrame[]
> >>> spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
> >>> INTO TABLE src")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py", 
> line 494, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 
> 57, in deco
> return f(*a, **kw)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
> : java.lang.UnsupportedOperationException: loadTable is not implemented
> at 
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)
> at org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> at org.apache.spark.sql.Dataset.(Dataset.scala:187)
> at org.apache.spark.sql.Dataset.(Dataset.scala:168)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Method 2: Using HiveContext: 
> {noformat}
> >>> from pyspark.sql import HiveContext
> >>> sqlContext = HiveContext(sc)
> >>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> DataFrame[]
> >>> sqlContext.sql("LOAD DATA LOCAL INPATH 
> >>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/context.py", 
> line 346, in sql
> return self.sparkSession.sql(sqlQuery)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py", 
> line 494, in sql
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>   File 
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  

[jira] [Resolved] (SPARK-15392) The default value of size estimation is not good

2016-05-19 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15392.
---
  Resolution: Fixed
Target Version/s: 2.0.0

> The default value of size estimation is not good
> 
>
> Key: SPARK-15392
> URL: https://issues.apache.org/jira/browse/SPARK-15392
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> We use  autoBroadcastJoinThreshold + 1L as the default value of size 
> estimation, that is not good in 2.0, because we will calculate the size based 
> on size of schema, then the estimation could be less than 
> autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame 
> created from RDD.
> We should use an even bigger default value, for example, MaxLong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15317) JobProgressListener takes a huge amount of memory with iterative DataFrame program in local, standalone

2016-05-19 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15317.
---
  Resolution: Fixed
Assignee: Shixiong Zhu
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> JobProgressListener takes a huge amount of memory with iterative DataFrame 
> program in local, standalone
> ---
>
> Key: SPARK-15317
> URL: https://issues.apache.org/jira/browse/SPARK-15317
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark 2.0, local mode + standalone mode on MacBook Pro 
> OSX 10.9
>Reporter: Joseph K. Bradley
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
> Attachments: cc_traces.txt, compare-1.6-10Kpartitions.png, 
> compare-2.0-10Kpartitions.png, compare-2.0-16partitions.png, 
> dump-standalone-2.0-1of4.png, dump-standalone-2.0-2of4.png, 
> dump-standalone-2.0-3of4.png, dump-standalone-2.0-4of4.png
>
>
> h2. TL;DR
> Running a small test locally, I found JobProgressListener consuming a huge 
> amount of memory.  There are many tasks being run, but it is still 
> surprising.  Summary, with details below:
> * Spark app: series of DataFrame joins
> * Issue: GC
> * Heap dump shows JobProgressListener taking 150 - 400MB, depending on the 
> Spark mode/version
> h2. Reproducing this issue
> h3.  With more complex code
> The code which fails:
> * Here is a branch with the code snippet which fails: 
> [https://github.com/jkbradley/spark/tree/18836174ab190d94800cc247f5519f3148822dce]
> ** This is based on Spark commit hash: 
> bb1362eb3b36b553dca246b95f59ba7fd8adcc8a
> * Look at {{CC.scala}}, which implements connected components using 
> DataFrames: 
> [https://github.com/jkbradley/spark/blob/18836174ab190d94800cc247f5519f3148822dce/mllib/src/main/scala/org/apache/spark/ml/CC.scala]
> In the spark shell, run:
> {code}
> import org.apache.spark.ml.CC
> import org.apache.spark.sql.SQLContext
> val sqlContext = SQLContext.getOrCreate(sc)
> CC.runTest(sqlContext)
> {code}
> I have attached a file {{cc_traces.txt}} with the stack traces from running 
> {{runTest}}.  Note that I sometimes had to run {{runTest}} twice to cause the 
> fatal exception.  This includes a trace for 1.6, which should run without 
> modifications to {{CC.scala}}.  These traces are from running in local mode.
> I used {{jmap}} to dump the heap:
> * local mode with 2.0: JobProgressListener took about 397 MB
> * standalone mode with 2.0: JobProgressListener took about 171 MB  (See 
> attached screenshots from MemoryAnalyzer)
> Both 1.6 and 2.0 exhibit this issue.  2.0 ran faster, and the issue 
> (JobProgressListener allocation) seems more severe with 2.0, though it could 
> just be that 2.0 makes more progress and runs more jobs.
> h3. With simpler code
> I ran this with master (~Spark 2.0):
> {code}
> val data = spark.range(0, 1, 1, 1)
> data.cache().count()
> {code}
> The resulting heap dump:
> * 78MB for {{scala.tools.nsc.interpreter.ILoop$ILoopInterpreter}}
> * 58MB for {{org.apache.spark.ui.jobs.JobProgressListener}}
> * 80MB for {{io.netty.buffer.PoolChunk}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15387) SessionCatalog in SimpleAnalyzer does not need to make database directory.

2016-05-19 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15387.
---
  Resolution: Fixed
Assignee: Kousuke Saruta
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> SessionCatalog in SimpleAnalyzer does not need to make database directory.
> --
>
> Key: SPARK-15387
> URL: https://issues.apache.org/jira/browse/SPARK-15387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 2.0.0
>
>
> After SPARK-15093 is fixed, we are forced to make /user/hive/warehouse when 
> SimpleAnalyzer is used but SimpleAnalyzer may not need the directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15300) Can't remove a block if it's under evicting

2016-05-19 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15300.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Can't remove a block if it's under evicting
> ---
>
> Key: SPARK-15300
> URL: https://issues.apache.org/jira/browse/SPARK-15300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> {code}
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned shuffle 94
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433121
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433122
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433123
> 16/04/15 12:17:05 INFO BlockManagerInfo: Removed broadcast_629_piece0 on 
> 10.0.164.43:39651 in memory (size: 23.4 KB, free: 15.8 GB)
> 16/04/15 12:17:05 ERROR BlockManagerSlaveEndpoint: Error in removing block 
> broadcast_631_piece0
> java.lang.IllegalStateException: Task -1024 has already locked 
> broadcast_631_piece0 for writing
>   at 
> org.apache.spark.storage.BlockInfoManager.lockForWriting(BlockInfoManager.scala:232)
>   at 
> org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1286)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$1.apply$mcZ$sp(BlockManagerSlaveEndpoint.scala:47)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$1.apply(BlockManagerSlaveEndpoint.scala:46)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$1.apply(BlockManagerSlaveEndpoint.scala:46)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:82)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/04/15 12:17:05 INFO BlockManagerInfo: Removed broadcast_626_piece0 on 
> 10.0.164.43:39651 in memory (size: 23.4 KB, free: 15.8 GB)
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433124
> 16/04/15 12:17:05 INFO BlockManagerInfo: Removed broadcast_627_piece0 on 
> 10.0.164.43:39651 in memory (size: 23.3 KB, free: 15.8 GB)
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433125
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15357) Cooperative spilling should check consumer memory mode

2016-05-16 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15357:
--
Description: 
In TaskMemoryManager.java:
{code}
for (MemoryConsumer c: consumers) {
  if (c != consumer && c.getUsed() > 0) {
try {
  long released = c.spill(required - got, consumer);
if (released > 0 && mode == tungstenMemoryMode) {
  got += memoryManager.acquireExecutionMemory(required - got, 
taskAttemptId, mode);
  if (got >= required) {
break;
  }
}
  } catch(...) { ... }
}
  }
}
{code}
Currently, when non-tungsten consumers acquire execution memory, they may force 
other tungsten consumers to spill and then NOT use the freed memory. A better 
way to do this is to incorporate the memory mode in the consumer itself and 
spill only those with matching memory modes.

  was:
In TaskMemoryManager.java:
{code}
for (MemoryConsumer c: consumers) {
  if (c != consumer && c.getUsed() > 0) {
try {
  long released = c.spill(required - got, consumer);
  if (released > 0 && mode == tungstenMemoryMode) {
logger.debug("Task {} released {} from {} for {}", 
taskAttemptId,
  Utils.bytesToString(released), c, consumer);
got += memoryManager.acquireExecutionMemory(required - got, 
taskAttemptId, mode);
if (got >= required) {
  break;
}
  }
} catch (IOException e) { ... }
  }
}
{code}
Currently, when non-tungsten consumers acquire execution memory, they may force 
other tungsten consumers to spill and then NOT use the freed memory. A better 
way to do this is to incorporate the memory mode in the consumer itself and 
spill only those with matching memory modes.


> Cooperative spilling should check consumer memory mode
> --
>
> Key: SPARK-15357
> URL: https://issues.apache.org/jira/browse/SPARK-15357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>
> In TaskMemoryManager.java:
> {code}
> for (MemoryConsumer c: consumers) {
>   if (c != consumer && c.getUsed() > 0) {
> try {
>   long released = c.spill(required - got, consumer);
> if (released > 0 && mode == tungstenMemoryMode) {
>   got += memoryManager.acquireExecutionMemory(required - got, 
> taskAttemptId, mode);
>   if (got >= required) {
> break;
>   }
> }
>   } catch(...) { ... }
> }
>   }
> }
> {code}
> Currently, when non-tungsten consumers acquire execution memory, they may 
> force other tungsten consumers to spill and then NOT use the freed memory. A 
> better way to do this is to incorporate the memory mode in the consumer 
> itself and spill only those with matching memory modes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15357) Cooperative spilling should check consumer memory mode

2016-05-16 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15357:
-

 Summary: Cooperative spilling should check consumer memory mode
 Key: SPARK-15357
 URL: https://issues.apache.org/jira/browse/SPARK-15357
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Andrew Or


In TaskMemoryManager.java:
{code}
for (MemoryConsumer c: consumers) {
  if (c != consumer && c.getUsed() > 0) {
try {
  long released = c.spill(required - got, consumer);
  if (released > 0 && mode == tungstenMemoryMode) {
logger.debug("Task {} released {} from {} for {}", 
taskAttemptId,
  Utils.bytesToString(released), c, consumer);
got += memoryManager.acquireExecutionMemory(required - got, 
taskAttemptId, mode);
if (got >= required) {
  break;
}
  }
} catch (IOException e) { ... }
  }
}
{code}
Currently, when non-tungsten consumers acquire execution memory, they may force 
other tungsten consumers to spill and then NOT use the freed memory. A better 
way to do this is to incorporate the memory mode in the consumer itself and 
spill only those with matching memory modes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14684) Verification of partition specs in SessionCatalog

2016-05-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14684.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Verification of partition specs in SessionCatalog
> -
>
> Key: SPARK-14684
> URL: https://issues.apache.org/jira/browse/SPARK-14684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> When attempting to drop partitions of a table, if the user provides an 
> unknown column, Hive will drop all the partitions of the table, which is 
> likely not intended. E.g.
> {code}
> ALTER TABLE my_tab DROP PARTITION (ds='2008-04-09', unknownCol='12')
> {code}
> We should verify that the columns provided in the specs are actually 
> partitioned columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15277) Checking Partition Spec Existence Before Dropping

2016-05-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15277.
---
  Resolution: Fixed
Assignee: Xiao Li
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Checking Partition Spec Existence Before Dropping
> -
>
> Key: SPARK-15277
> URL: https://issues.apache.org/jira/browse/SPARK-15277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Now, we start dropping the partition before completing checking the existence 
> of all the partition specs. If one partition spec does not exist, we just 
> stop processing the command. Some partitions might not be dropped but some 
> partitions have been dropped. We should check the existence at first before 
> dropping any partition. 
> If any failure happened after we start to drop the partitions, we should log 
> an error message to indicate which partitions have been dropped and which 
> partitions have not been dropped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14684) Verification of partition specs in SessionCatalog

2016-05-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14684:
--
Description: 
When attempting to drop partitions of a table, if the user provides an unknown 
column, Hive will drop all the partitions of the table, which is likely not 
intended. E.g.
{code}
ALTER TABLE my_tab DROP PARTITION (ds='2008-04-09', unknownCol='12')
{code}
We should verify that the columns provided in the specs are actually 
partitioned columns.

  was:When users inputting invalid partition spec, we might not be able to 
catch and issue the error messages. Sometimes, it could cause a disaster 
result. For example, previously, when we alter a table and drop a partition 
with invalid spec, it could drop all the partitions due to a bug/defect in Hive 
Metastore API. 


> Verification of partition specs in SessionCatalog
> -
>
> Key: SPARK-14684
> URL: https://issues.apache.org/jira/browse/SPARK-14684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When attempting to drop partitions of a table, if the user provides an 
> unknown column, Hive will drop all the partitions of the table, which is 
> likely not intended. E.g.
> {code}
> ALTER TABLE my_tab DROP PARTITION (ds='2008-04-09', unknownCol='12')
> {code}
> We should verify that the columns provided in the specs are actually 
> partitioned columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15289) SQL test compilation error from merge conflict

2016-05-12 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281678#comment-15281678
 ] 

Andrew Or commented on SPARK-15289:
---

Done, thanks for the ping.

> SQL test compilation error from merge conflict
> --
>
> Key: SPARK-15289
> URL: https://issues.apache.org/jira/browse/SPARK-15289
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Milanowski
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Spark build fails during SQL build. Concerns commit 
> 6b69b8c0c778f4cba2b281fe3ad225dc922f82d6, but also earlier ones; build works 
> e.g. for commit c6d23b6604e85bcddbd1fb6a2c1c3edbfd2be2c1. 
> Run with command:
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> Result:
> {code}
> [error] 
> /home/bpol0421/various/spark/sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:282:
>  not found: value sparkSession
> [error] val dbString = CatalogImpl.makeDataset(Seq(db), 
> sparkSession).showString(10)
> [error] ^
> [error] 
> /home/bpol0421/various/spark/sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:283:
>  not found: value sparkSession
> [error] val tableString = CatalogImpl.makeDataset(Seq(table), 
> sparkSession).showString(10)
> [error]   ^
> [error] 
> /home/bpol0421/various/spark/sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:284:
>  not found: value sparkSession
> [error] val functionString = CatalogImpl.makeDataset(Seq(function), 
> sparkSession).showString(10)
> [error] ^
> [error] 
> /home/bpol0421/various/spark/sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:285:
>  not found: value sparkSession
> [error] val columnString = CatalogImpl.makeDataset(Seq(column), 
> sparkSession).showString(10)
> [error] ^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15289) SQL test compilation error from merge conflict

2016-05-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15289.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> SQL test compilation error from merge conflict
> --
>
> Key: SPARK-15289
> URL: https://issues.apache.org/jira/browse/SPARK-15289
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Milanowski
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Spark build fails during SQL build. Concerns commit 
> 6b69b8c0c778f4cba2b281fe3ad225dc922f82d6, but also earlier ones; build works 
> e.g. for commit c6d23b6604e85bcddbd1fb6a2c1c3edbfd2be2c1. 
> Run with command:
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> Result:
> {code}
> [error] 
> /home/bpol0421/various/spark/sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:282:
>  not found: value sparkSession
> [error] val dbString = CatalogImpl.makeDataset(Seq(db), 
> sparkSession).showString(10)
> [error] ^
> [error] 
> /home/bpol0421/various/spark/sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:283:
>  not found: value sparkSession
> [error] val tableString = CatalogImpl.makeDataset(Seq(table), 
> sparkSession).showString(10)
> [error]   ^
> [error] 
> /home/bpol0421/various/spark/sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:284:
>  not found: value sparkSession
> [error] val functionString = CatalogImpl.makeDataset(Seq(function), 
> sparkSession).showString(10)
> [error] ^
> [error] 
> /home/bpol0421/various/spark/sql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala:285:
>  not found: value sparkSession
> [error] val columnString = CatalogImpl.makeDataset(Seq(column), 
> sparkSession).showString(10)
> [error] ^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15264) Spark 2.0 CSV Reader: NPE on Blank Column Names

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15264.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Spark 2.0 CSV Reader: NPE on Blank Column Names
> ---
>
> Key: SPARK-15264
> URL: https://issues.apache.org/jira/browse/SPARK-15264
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>Assignee: Bill Chambers
> Fix For: 2.0.0
>
>
> When you read in a csv file that starts with blank column names the read 
> fails when you specify that you want a header.
> Pull request coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15274) CSV default column names should be consistent

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15274.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> CSV default column names should be consistent
> -
>
> Key: SPARK-15274
> URL: https://issues.apache.org/jira/browse/SPARK-15274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Bill Chambers
> Fix For: 2.0.0
>
>
> If a column name is not provided, Spark SQL usually uses the convention 
> "_c0", "_c1" etc., but when reading in CSV files without headers, we use "C0" 
> and "C1". This is inconsistent and we should fix it by Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15276) CREATE TABLE with LOCATION should imply EXTERNAL

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15276.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> CREATE TABLE with LOCATION should imply EXTERNAL
> 
>
> Key: SPARK-15276
> URL: https://issues.apache.org/jira/browse/SPARK-15276
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> If the user runs `CREATE TABLE some_table ... LOCATION /some/path`, then this 
> will still be a managed table even though the table's data is stored at 
> /some/path. The problem is that when we drop the table we'll also delete the 
> data /some/path. This could cause problems if /some/path contains existing 
> data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15279) Disallow ROW FORMAT and STORED AS (parquet | orc | avro etc.)

2016-05-11 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15279:
-

 Summary: Disallow ROW FORMAT and STORED AS (parquet | orc | avro 
etc.)
 Key: SPARK-15279
 URL: https://issues.apache.org/jira/browse/SPARK-15279
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


They are both potentially conflicting ways that allow you to specify the SerDe. 
Unfortunately, we can't just get rid of ROW FORMAT because it may be used with 
TEXTFILE or RCFILE. For other file formats, we should fail fast wherever 
possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15275) CatalogTable should store sort ordering for sorted columns

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15275:
--
Summary: CatalogTable should store sort ordering for sorted columns  (was: 
[SQL] CatalogTable should store sort ordering for sorted columns)

> CatalogTable should store sort ordering for sorted columns
> --
>
> Key: SPARK-15275
> URL: https://issues.apache.org/jira/browse/SPARK-15275
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Priority: Trivial
>
> For bucketed tables in Hive, one can also add constraint about column 
> sortedness along with ordering.
> As per the spec in [0], CREATE TABLE statement can allow SORT ordering as 
> well:
>   [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], 
> ...)] INTO num_buckets BUCKETS]
> See [1] for example. 
> [0] : 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
> [1] : 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
> Currently CatalogTable does not store any information about the sort ordering 
> and just has the names of the sorted columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15275) CatalogTable should store sort ordering for sorted columns

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15275:
--
Priority: Major  (was: Trivial)

> CatalogTable should store sort ordering for sorted columns
> --
>
> Key: SPARK-15275
> URL: https://issues.apache.org/jira/browse/SPARK-15275
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>
> For bucketed tables in Hive, one can also add constraint about column 
> sortedness along with ordering.
> As per the spec in [0], CREATE TABLE statement can allow SORT ordering as 
> well:
>   [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], 
> ...)] INTO num_buckets BUCKETS]
> See [1] for example. 
> [0] : 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
> [1] : 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
> Currently CatalogTable does not store any information about the sort ordering 
> and just has the names of the sorted columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15275) CatalogTable should store sort ordering for sorted columns

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15275:
--
Assignee: Tejas Patil

> CatalogTable should store sort ordering for sorted columns
> --
>
> Key: SPARK-15275
> URL: https://issues.apache.org/jira/browse/SPARK-15275
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Trivial
>
> For bucketed tables in Hive, one can also add constraint about column 
> sortedness along with ordering.
> As per the spec in [0], CREATE TABLE statement can allow SORT ordering as 
> well:
>   [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], 
> ...)] INTO num_buckets BUCKETS]
> See [1] for example. 
> [0] : 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
> [1] : 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
> Currently CatalogTable does not store any information about the sort ordering 
> and just has the names of the sorted columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15275) CatalogTable should store sort ordering for sorted columns

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15275:
--
Affects Version/s: (was: 1.6.1)
   2.0.0

> CatalogTable should store sort ordering for sorted columns
> --
>
> Key: SPARK-15275
> URL: https://issues.apache.org/jira/browse/SPARK-15275
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>
> For bucketed tables in Hive, one can also add constraint about column 
> sortedness along with ordering.
> As per the spec in [0], CREATE TABLE statement can allow SORT ordering as 
> well:
>   [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], 
> ...)] INTO num_buckets BUCKETS]
> See [1] for example. 
> [0] : 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
> [1] : 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
> Currently CatalogTable does not store any information about the sort ordering 
> and just has the names of the sorted columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15276) CREATE TABLE with LOCATION should imply EXTERNAL

2016-05-11 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15276:
-

 Summary: CREATE TABLE with LOCATION should imply EXTERNAL
 Key: SPARK-15276
 URL: https://issues.apache.org/jira/browse/SPARK-15276
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


If the user runs `CREATE TABLE some_table ... LOCATION /some/path`, then this 
will still be a managed table even though the table's data is stored at 
/some/path. The problem is that when we drop the table we'll also delete the 
data /some/path. This could cause problems if /some/path contains existing data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15264) Spark 2.0 CSV Reader: Error on Blank Column Names

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15264:
--
Assignee: Bill Chambers

> Spark 2.0 CSV Reader: Error on Blank Column Names
> -
>
> Key: SPARK-15264
> URL: https://issues.apache.org/jira/browse/SPARK-15264
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>
> When you read in a csv file that starts with blank column names the read 
> fails when you specify that you want a header.
> Pull request coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15264) Spark 2.0 CSV Reader: NPE on Blank Column Names

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15264:
--
Summary: Spark 2.0 CSV Reader: NPE on Blank Column Names  (was: Spark 2.0 
CSV Reader: Error on Blank Column Names)

> Spark 2.0 CSV Reader: NPE on Blank Column Names
> ---
>
> Key: SPARK-15264
> URL: https://issues.apache.org/jira/browse/SPARK-15264
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>
> When you read in a csv file that starts with blank column names the read 
> fails when you specify that you want a header.
> Pull request coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15264) Spark 2.0 CSV Reader: Error on Blank Column Names

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15264:
--
Target Version/s: 2.0.0

> Spark 2.0 CSV Reader: Error on Blank Column Names
> -
>
> Key: SPARK-15264
> URL: https://issues.apache.org/jira/browse/SPARK-15264
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>
> When you read in a csv file that starts with blank column names the read 
> fails when you specify that you want a header.
> Pull request coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15274) CSV default column names should be consistent

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15274:
--
Assignee: Bill Chambers

> CSV default column names should be consistent
> -
>
> Key: SPARK-15274
> URL: https://issues.apache.org/jira/browse/SPARK-15274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Bill Chambers
>
> If a column name is not provided, Spark SQL usually uses the convention 
> "_c0", "_c1" etc., but when reading in CSV files without headers, we use "C0" 
> and "C1". This is inconsistent and we should fix it by Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15274) CSV default column names should be consistent

2016-05-11 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15274:
-

 Summary: CSV default column names should be consistent
 Key: SPARK-15274
 URL: https://issues.apache.org/jira/browse/SPARK-15274
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or


If a column name is not provided, Spark SQL usually uses the convention "_c0", 
"_c1" etc., but when reading in CSV files without headers, we use "C0" and 
"C1". This is inconsistent and we should fix it by Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13566) Deadlock between MemoryStore and BlockManager

2016-05-11 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280615#comment-15280615
 ] 

Andrew Or commented on SPARK-13566:
---

[~ekeddy] This only happens with the unified memory manager, so you could 
switch back to the static memory manager by setting 
`spark.memory.useLegacyMode` to true. You may observe a decrease in performance 
if you do that, however.

> Deadlock between MemoryStore and BlockManager
> -
>
> Key: SPARK-13566
> URL: https://issues.apache.org/jira/browse/SPARK-13566
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0
> Environment: Spark 1.6.0 hadoop2.2.0 jdk1.8.0_65 centOs 6.2
>Reporter: cen yuhai
>Assignee: cen yuhai
> Fix For: 1.6.2
>
>
> ===
> "block-manager-slave-async-thread-pool-1":
> at org.apache.spark.storage.MemoryStore.remove(MemoryStore.scala:216)
> - waiting to lock <0x0005895b09b0> (a 
> org.apache.spark.memory.UnifiedMemoryManager)
> at 
> org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1114)
> - locked <0x00058ed6aae0> (a org.apache.spark.storage.BlockInfo)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101)
> at scala.collection.immutable.Set$Set2.foreach(Set.scala:94)
> at 
> org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1101)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:65)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
> at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:84)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> "Executor task launch worker-10":
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1032)
> - waiting to lock <0x00059a0988b8> (a 
> org.apache.spark.storage.BlockInfo)
> at 
> org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1009)
> at 
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:460)
> at 
> org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:449)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15262) race condition in killing an executor and reregistering an executor

2016-05-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15262:
--
Target Version/s: 1.6.2, 2.0.0

> race condition in killing an executor and reregistering an executor
> ---
>
> Key: SPARK-15262
> URL: https://issues.apache.org/jira/browse/SPARK-15262
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Shixiong Zhu
>
> There is a race condition when killing an executor and reregistering an 
> executor happen at the same time. Here is the execution steps to reproduce it.
> 1. master find a worker is dead
> 2. master tells driver to remove executor
> 3. driver remove executor
> 4. BlockManagerMasterEndpoint remove the block manager
> 5. executor finds it's not reigstered via heartbeat
> 6. executor send reregister block manager
> 7. register block manager
> 8. executor is killed by worker
> 9. CoarseGrainedSchedulerBackend ignores onDisconnected as this address is 
> not in the executor list
> 10. BlockManagerMasterEndpoint.blockManagerInfo contains dead block managers
> As BlockManagerMasterEndpoint.blockManagerInfo contains some dead block 
> managers, when we unpersist a RDD, remove a broadcast, or clean a shuffle 
> block via a RPC endpoint of a dead block manager, we will get 
> ClosedChannelException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15249) Use FunctionResource instead of (String, String) in CreateFunction and CatalogFunction for resource

2016-05-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15249.
---
  Resolution: Fixed
Assignee: Sandeep Singh
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Use FunctionResource instead of (String, String) in CreateFunction and 
> CatalogFunction for resource
> ---
>
> Key: SPARK-15249
> URL: https://issues.apache.org/jira/browse/SPARK-15249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Minor
> Fix For: 2.0.0
>
>
> Use FunctionResource instead of (String, String) in CreateFunction and 
> CatalogFunction for resource
> see: TODO's here
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L36
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala#L42



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15257) Require CREATE EXTERNAL TABLE to specify LOCATION

2016-05-10 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15257:
-

 Summary: Require CREATE EXTERNAL TABLE to specify LOCATION
 Key: SPARK-15257
 URL: https://issues.apache.org/jira/browse/SPARK-15257
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


Right now when the user runs `CREATE EXTERNAL TABLE` without specifying 
`LOCATION`, the table will still be created in the warehouse directory, but its 
metadata won't be deleted even when the user drops the table! This is a 
problem. We should use require the user to also specify `LOCATION`.

Note: This doesn't not apply to `CREATE EXTERNAL TABLE ... USING`, which is not 
yet supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14857) Table/Database Name Validation in SessionCatalog

2016-05-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14857:
--
Assignee: Xiao Li

> Table/Database Name Validation in SessionCatalog
> 
>
> Key: SPARK-14857
> URL: https://issues.apache.org/jira/browse/SPARK-14857
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> We need validate the database/table names before storing these information in 
> `ExternalCatalog`. 
> For example, if users use `backstick` to quote the table/database names 
> containing illegal characters, these names are allowed by Spark Parser, but 
> Hive metastore does not allow them. We need to catch them in SessionCatalog 
> and issue an appropriate error message.
> ```
> CREATE TABLE `tab:1`  ...
> ```
> This PR enforces the name rules of Spark SQL for `table`/`database`/`view`: 
> `only can contain alphanumeric and underscore characters.` Different from 
> Hive, we allow the names with starting underscore characters. 
> The validation of function/column names will be done in a separate JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14603) SessionCatalog needs to check if a metadata operation is valid

2016-05-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14603.
---
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.0.0

> SessionCatalog needs to check if a metadata operation is valid
> --
>
> Key: SPARK-14603
> URL: https://issues.apache.org/jira/browse/SPARK-14603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> Since we cannot really trust if the underlying external catalog can throw 
> exceptions when there is an invalid metadata operation, let's do it in 
> SessionCatalog. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14684) Verification of partition specs in SessionCatalog

2016-05-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14684:
--
Assignee: Xiao Li

> Verification of partition specs in SessionCatalog
> -
>
> Key: SPARK-14684
> URL: https://issues.apache.org/jira/browse/SPARK-14684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When users inputting invalid partition spec, we might not be able to catch 
> and issue the error messages. Sometimes, it could cause a disaster result. 
> For example, previously, when we alter a table and drop a partition with 
> invalid spec, it could drop all the partitions due to a bug/defect in Hive 
> Metastore API. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15037) Use SparkSession instead of SQLContext in testsuites

2016-05-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15037:
--
Component/s: SQL

> Use SparkSession instead of SQLContext in testsuites
> 
>
> Key: SPARK-15037
> URL: https://issues.apache.org/jira/browse/SPARK-15037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Dongjoon Hyun
>Assignee: Sandeep Singh
> Fix For: 2.0.0
>
>
> This issue aims to update the existing testsuites to use `SparkSession` 
> instread of `SQLContext` since `SQLContext` exists just for backward 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15037) Use SparkSession instead of SQLContext in testsuites

2016-05-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15037:
--
Component/s: Tests

> Use SparkSession instead of SQLContext in testsuites
> 
>
> Key: SPARK-15037
> URL: https://issues.apache.org/jira/browse/SPARK-15037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Dongjoon Hyun
>Assignee: Sandeep Singh
> Fix For: 2.0.0
>
>
> This issue aims to update the existing testsuites to use `SparkSession` 
> instread of `SQLContext` since `SQLContext` exists just for backward 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15037) Use SparkSession instead of SQLContext in testsuites

2016-05-10 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15037.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Use SparkSession instead of SQLContext in testsuites
> 
>
> Key: SPARK-15037
> URL: https://issues.apache.org/jira/browse/SPARK-15037
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Dongjoon Hyun
>Assignee: Sandeep Singh
> Fix For: 2.0.0
>
>
> This issue aims to update the existing testsuites to use `SparkSession` 
> instread of `SQLContext` since `SQLContext` exists just for backward 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15236) No way to disable Hive support in REPL

2016-05-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15236:
--
Component/s: Spark Shell

> No way to disable Hive support in REPL
> --
>
> Key: SPARK-15236
> URL: https://issues.apache.org/jira/browse/SPARK-15236
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> If you built Spark with Hive classes, there's no switch to flip to start a 
> new `spark-shell` using the InMemoryCatalog. The only thing you can do now is 
> to rebuild Spark again. That is quite inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15236) No way to disable Hive support in REPL

2016-05-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15236:
--
Assignee: (was: Andrew Or)

> No way to disable Hive support in REPL
> --
>
> Key: SPARK-15236
> URL: https://issues.apache.org/jira/browse/SPARK-15236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> If you built Spark with Hive classes, there's no switch to flip to start a 
> new `spark-shell` using the InMemoryCatalog. The only thing you can do now is 
> to rebuild Spark again. That is quite inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15236) No way to disable Hive support in REPL

2016-05-09 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15236:
-

 Summary: No way to disable Hive support in REPL
 Key: SPARK-15236
 URL: https://issues.apache.org/jira/browse/SPARK-15236
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or


If you built Spark with Hive classes, there's no switch to flip to start a new 
`spark-shell` using the InMemoryCatalog. The only thing you can do now is to 
rebuild Spark again. That is quite inconvenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-15234:
-

Assignee: Andrew Or

> spark.catalog.listDatabases.show() is not formatted correctly
> -
>
> Key: SPARK-15234
> URL: https://issues.apache.org/jira/browse/SPARK-15234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> scala> spark.catalog.listDatabases.show()
> ++---+---+
> |name|description|locationUri|
> ++---+---+
> |Database[name='de...|
> |Database[name='my...|
> |Database[name='so...|
> ++---+---+
> {code}
> It's because org.apache.spark.sql.catalog.Database is not a case class!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Andrew Or (JIRA)
Andrew Or created SPARK-15234:
-

 Summary: spark.catalog.listDatabases.show() is not formatted 
correctly
 Key: SPARK-15234
 URL: https://issues.apache.org/jira/browse/SPARK-15234
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or


{code}
scala> spark.catalog.listDatabases.show()
++---+---+
|name|description|locationUri|
++---+---+
|Database[name='de...|
|Database[name='my...|
|Database[name='so...|
++---+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly

2016-05-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15234:
--
Description: 
{code}
scala> spark.catalog.listDatabases.show()
++---+---+
|name|description|locationUri|
++---+---+
|Database[name='de...|
|Database[name='my...|
|Database[name='so...|
++---+---+
{code}

It's because org.apache.spark.sql.catalog.Database is not a case class!

  was:
{code}
scala> spark.catalog.listDatabases.show()
++---+---+
|name|description|locationUri|
++---+---+
|Database[name='de...|
|Database[name='my...|
|Database[name='so...|
++---+---+
{code}


> spark.catalog.listDatabases.show() is not formatted correctly
> -
>
> Key: SPARK-15234
> URL: https://issues.apache.org/jira/browse/SPARK-15234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> {code}
> scala> spark.catalog.listDatabases.show()
> ++---+---+
> |name|description|locationUri|
> ++---+---+
> |Database[name='de...|
> |Database[name='my...|
> |Database[name='so...|
> ++---+---+
> {code}
> It's because org.apache.spark.sql.catalog.Database is not a case class!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14021) Support custom context derived from HiveContext for SparkSQLEnv

2016-05-09 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276820#comment-15276820
 ] 

Andrew Or commented on SPARK-14021:
---

Closing as Won't Fix because the issue is outdated after HiveContext was 
removed.

> Support custom context derived from HiveContext for SparkSQLEnv
> ---
>
> Key: SPARK-14021
> URL: https://issues.apache.org/jira/browse/SPARK-14021
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Adrian Wang
>
> This is to create a custom context for command bin/spark-sql and 
> sbin/start-thriftserver. Any context that is derived from HiveContext is 
> acceptable. User need to configure the class name of custom context in a 
> config of spark.sql.context.class, and make sure the class in classpath. This 
> is to provide a more elegant way for custom configurations and changes for 
> infrastructure team.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14021) Support custom context derived from HiveContext for SparkSQLEnv

2016-05-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14021.
---
Resolution: Won't Fix

> Support custom context derived from HiveContext for SparkSQLEnv
> ---
>
> Key: SPARK-14021
> URL: https://issues.apache.org/jira/browse/SPARK-14021
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Adrian Wang
>
> This is to create a custom context for command bin/spark-sql and 
> sbin/start-thriftserver. Any context that is derived from HiveContext is 
> acceptable. User need to configure the class name of custom context in a 
> config of spark.sql.context.class, and make sure the class in classpath. This 
> is to provide a more elegant way for custom configurations and changes for 
> infrastructure team.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10653) Remove unnecessary things from SparkEnv

2016-05-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10653.
---
  Resolution: Fixed
Assignee: Alex Bozarth
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Remove unnecessary things from SparkEnv
> ---
>
> Key: SPARK-10653
> URL: https://issues.apache.org/jira/browse/SPARK-10653
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> As of the writing of this message, there are at least two things that can be 
> removed from it:
> {code}
> @DeveloperApi
> class SparkEnv (
> val executorId: String,
> private[spark] val rpcEnv: RpcEnv,
> val serializer: Serializer,
> val closureSerializer: Serializer,
> val cacheManager: CacheManager,
> val mapOutputTracker: MapOutputTracker,
> val shuffleManager: ShuffleManager,
> val broadcastManager: BroadcastManager,
> val blockTransferService: BlockTransferService, // this one can go
> val blockManager: BlockManager,
> val securityManager: SecurityManager,
> val httpFileServer: HttpFileServer,
> val sparkFilesDir: String, // this one maybe? It's only used in 1 place.
> val metricsSystem: MetricsSystem,
> val shuffleMemoryManager: ShuffleMemoryManager,
> val executorMemoryManager: ExecutorMemoryManager, // this can go
> val outputCommitCoordinator: OutputCommitCoordinator,
> val conf: SparkConf) extends Logging {
>   ...
> }
> {code}
> We should avoid adding to this infinite list of things in SparkEnv's 
> constructors if they're not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15210) Add missing @DeveloperApi annotation in sql.types

2016-05-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15210.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Add missing @DeveloperApi annotation in sql.types
> -
>
> Key: SPARK-15210
> URL: https://issues.apache.org/jira/browse/SPARK-15210
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.0.0
>
>
> @DeveloperApi annotation for {{AbstractDataType}} {{MapType}} 
> {{UserDefinedType}} are missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15166) Move hive-specific conf setting from SparkSession

2016-05-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15166.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Move hive-specific conf setting from SparkSession
> -
>
> Key: SPARK-15166
> URL: https://issues.apache.org/jira/browse/SPARK-15166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15210) Add missing @DeveloperApi annotation in sql.types

2016-05-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15210:
--
Assignee: zhengruifeng

> Add missing @DeveloperApi annotation in sql.types
> -
>
> Key: SPARK-15210
> URL: https://issues.apache.org/jira/browse/SPARK-15210
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.0.0
>
>
> @DeveloperApi annotation for {{AbstractDataType}} {{MapType}} 
> {{UserDefinedType}} are missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15220) Add hyperlink to "running application" and "completed application"

2016-05-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15220.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Add hyperlink to "running application" and "completed application"
> --
>
> Key: SPARK-15220
> URL: https://issues.apache.org/jira/browse/SPARK-15220
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Mao, Wei
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add hyperlink to "running application" and "completed application", so user 
> can jump to application table directly, In my environment, I set up 1000+ 
> works and it's painful to scroll down to skip worker list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15067) YARN executors are launched with fixed perm gen size

2016-05-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-15067.
---
  Resolution: Fixed
Assignee: Sean Owen
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> YARN executors are launched with fixed perm gen size
> 
>
> Key: SPARK-15067
> URL: https://issues.apache.org/jira/browse/SPARK-15067
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Renato Falchi Brandão
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.0
>
>
> It is impossible to change the executors max perm gen size using the property 
> "spark.executor.extraJavaOptions" when you are running on YARN.
> When the JVM option "-XX:MaxPermSize" is set through the property 
> "spark.executor.extraJavaOptions", Spark put it properly in the shell command 
> that will start the JVM container but, in the ending of command, it sets 
> again this option using a fixed value of 256m, as you can see in the log I've 
> extracted:
> 2016-04-30 17:20:12 INFO  ExecutorRunnable:58 -
> ===
> YARN executor launch context:
>   env:
> CLASSPATH -> 
> {{PWD}}{{PWD}}/__spark__.jar$HADOOP_CONF_DIR/usr/hdp/current/hadoop-client/*/usr/hdp/current/hadoop-client/lib/*/usr/hdp/current/hadoop-hdfs-client/*/usr/hdp/current/hadoop-hdfs-client/lib/*/usr/hdp/current/hadoop-yarn-client/*/usr/hdp/current/hadoop-yarn-client/lib/*/usr/hdp/mr-framework/hadoop/share/hadoop/mapreduce/*:/usr/hdp/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/common/*:/usr/hdp/mr-framework/hadoop/share/hadoop/common/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/yarn/*:/usr/hdp/mr-framework/hadoop/share/hadoop/yarn/lib/*:/usr/hdp/mr-framework/hadoop/share/hadoop/hdfs/*:/usr/hdp/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/current/hadoop/lib/hadoop-lzo-0.6.0.jar:/etc/hadoop/conf/secure
> SPARK_LOG_URL_STDERR -> 
> http://x0668sl.x.br:8042/node/containerlogs/container_1456962126505_329993_01_02/h_loadbd/stderr?start=-4096
> SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1456962126505_329993
> SPARK_YARN_CACHE_FILES_FILE_SIZES -> 191719054,166
> SPARK_USER -> h_loadbd
> SPARK_YARN_CACHE_FILES_VISIBILITIES -> PUBLIC,PUBLIC
> SPARK_YARN_MODE -> true
> SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1459806496093,1459808508343
> SPARK_LOG_URL_STDOUT -> 
> http://x0668sl.x.br:8042/node/containerlogs/container_1456962126505_329993_01_02/h_loadbd/stdout?start=-4096
> SPARK_YARN_CACHE_FILES -> 
> hdfs://x/user/datalab/hdp/spark/lib/spark-assembly-1.6.0.2.3.4.1-10-hadoop2.7.1.2.3.4.1-10.jar#__spark__.jar,hdfs://tlvcluster/user/datalab/hdp/spark/conf/hive-site.xml#hive-site.xml
>   command:
> {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m 
> -Xmx6144m '-XX:+PrintGCDetails' '-XX:MaxPermSize=1024M' 
> '-XX:+PrintGCTimeStamps' -Djava.io.tmpdir={{PWD}}/tmp 
> '-Dspark.akka.timeout=30' '-Dspark.driver.port=62875' 
> '-Dspark.rpc.askTimeout=30' '-Dspark.rpc.lookupTimeout=30' 
> -Dspark.yarn.app.container.log.dir= -XX:MaxPermSize=256m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@10.125.81.42:62875 --executor-id 1 --hostname 
> x0668sl.x.br --cores 1 --app-id application_1456962126505_329993 
> --user-class-path file:$PWD/__app__.jar 1> /stdout 2> 
> /stderr
> Analyzing the code is possible to see that all the options set in the 
> property "spark.executor.extraJavaOptions" are enclosed, one by one, in 
> single quotes (ExecutorRunnable.scala:151) before the launcher take the 
> decision if a default value has to be provided or not for the option 
> "-XX:MaxPermSize" (ExecutorRunnable.scala:202).
> This decision is taken examining all the options set and looking for a string 
> starting with the value "-XX:MaxPermSize" (CommandBuilderUtils.java:328). If 
> that value is not found, the default value is set.
> A string option starting without single quote will never be found, then, a 
> default value will always be provided.
> A possible solution is change the source code of CommandBuilderUtils.java in 
> the line 328:
> From-> if (arg.startsWith("-XX:MaxPermSize="))
> To-> if (arg.indexOf("-XX:MaxPermSize=") > -1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >