date:20170308

[jira] [Updated] (SPARK-19881) Support Dynamic Partition Inserts params with SET command

2017-03-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19881:
--
Description: 
Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In most 
case, Spark handles well. However, for the dynamic partition insert, users meet 
the following misleading situation. 

{code}
scala> spark.range(1001).selectExpr("id as key", "id as 
value").registerTempTable("t1001")

scala> sql("create table p (value int) partitioned by (key int)").show

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.spark.SparkException:
Dynamic partition strict mode requires at least one static partition column.
To turn this off set hive.exec.dynamic.partition.mode=nonstrict

scala> sql("set hive.exec.dynamic.partition.mode=nonstrict")

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.hadoop.hive.ql.metadata.HiveException:
Number of dynamic partitions created is 1001, which is more than 1000.
To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.

scala> sql("set hive.exec.max.dynamic.partitions=1001")

scala> sql("set hive.exec.max.dynamic.partitions").show(false)
++-+
|key |value|
++-+
|hive.exec.max.dynamic.partitions|1001 |
++-+

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.hadoop.hive.ql.metadata.HiveException:
Number of dynamic partitions created is 1001, which is more than 1000.
To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.
{code}

The last error is the same with the previous one. `HiveClient` does not know 
new value 1001. There is no way to change the default value of 
`hive.exec.max.dynamic.partitions` of `HiveCilent` with `SET` command.

The root cause is that `hive` parameters are passed to `HiveClient` on 
creating. So, the workaround is to use `--hiveconf` when starting 
`spark-shell`. However, it is still unchangeable in `spark-shell`. We had 
better handle this case without misleading error messages ending infinite loop.

  was:
## What changes were proposed in this pull request?

Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In most 
case, Spark handles well. However, for the dynamic partition insert, users meet 
the following misleading situation. 

{code}
scala> spark.range(1001).selectExpr("id as key", "id as 
value").registerTempTable("t1001")

scala> sql("create table p (value int) partitioned by (key int)").show

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.spark.SparkException:
Dynamic partition strict mode requires at least one static partition column.
To turn this off set hive.exec.dynamic.partition.mode=nonstrict

scala> sql("set hive.exec.dynamic.partition.mode=nonstrict")

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.hadoop.hive.ql.metadata.HiveException:
Number of dynamic partitions created is 1001, which is more than 1000.
To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.

scala> sql("set hive.exec.max.dynamic.partitions=1001")

scala> sql("set hive.exec.max.dynamic.partitions").show(false)
++-+
|key |value|
++-+
|hive.exec.max.dynamic.partitions|1001 |
++-+

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.hadoop.hive.ql.metadata.HiveException:
Number of dynamic partitions created is 1001, which is more than 1000.
To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.
{code}

The last error is the same with the previous one. `HiveClient` does not know 
new value 1001. There is no way to change the default value of 
`hive.exec.max.dynamic.partitions` of `HiveCilent` with `SET` command.

The root cause is that `hive` parameters are passed to `HiveClient` on 
creating. So, the workaround is to use `--hiveconf` when starting 
`spark-shell`. However, it is still unchangeable in `spark-shell`. We had 
better handle this case without misleading error messages ending infinite loop.


> Support Dynamic Partition Inserts params with SET command
> -
>
> Key: SPARK-19881
> URL: https://issues.apache.org/jira/browse/SPARK-19881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In 
> most case, Spark handles well. However, for the dynamic

[jira] [Updated] (SPARK-19881) Support Dynamic Partition Inserts params with SET command

2017-03-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19881:
--
Description: 
## What changes were proposed in this pull request?

Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In most 
case, Spark handles well. However, for the dynamic partition insert, users meet 
the following misleading situation. 

{code}
scala> spark.range(1001).selectExpr("id as key", "id as 
value").registerTempTable("t1001")

scala> sql("create table p (value int) partitioned by (key int)").show

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.spark.SparkException:
Dynamic partition strict mode requires at least one static partition column.
To turn this off set hive.exec.dynamic.partition.mode=nonstrict

scala> sql("set hive.exec.dynamic.partition.mode=nonstrict")

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.hadoop.hive.ql.metadata.HiveException:
Number of dynamic partitions created is 1001, which is more than 1000.
To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.

scala> sql("set hive.exec.max.dynamic.partitions=1001")

scala> sql("set hive.exec.max.dynamic.partitions").show(false)
++-+
|key |value|
++-+
|hive.exec.max.dynamic.partitions|1001 |
++-+

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.hadoop.hive.ql.metadata.HiveException:
Number of dynamic partitions created is 1001, which is more than 1000.
To solve this try to set hive.exec.max.dynamic.partitions to at least 1001.
{code}

The last error is the same with the previous one. `HiveClient` does not know 
new value 1001. There is no way to change the default value of 
`hive.exec.max.dynamic.partitions` of `HiveCilent` with `SET` command.

The root cause is that `hive` parameters are passed to `HiveClient` on 
creating. So, the workaround is to use `--hiveconf` when starting 
`spark-shell`. However, it is still unchangeable in `spark-shell`. We had 
better handle this case without misleading error messages ending infinite loop.

  was:
Currently, `SET` command does not pass the values to Hive. In most case, Spark 
handles well. However, for the dynamic partition insert, users meet the 
following situation. 

{code}
scala> spark.range(1001).selectExpr("id as key", "id as 
value").registerTempTable("t1001")

scala> sql("create table p (value int) partitioned by (key int)").show

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.spark.SparkException: Dynamic partition strict mode requires at 
least one static partition column. To turn this off set 
hive.exec.dynamic.partition.mode=nonstrict

scala> sql("set hive.exec.dynamic.partition.mode=nonstrict")

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions 
created is 1001, which is more than 1000. To solve this try to set 
hive.exec.max.dynamic.partitions to at least 1001.

scala> sql("set hive.exec.dynamic.partition.mode=1001")

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions 
created is 1001, which is more than 1000. To solve this try to set 
hive.exec.max.dynamic.partitions to at least 1001.

<== Repeat the same error message.
{code}

The root cause is that `hive` parameters are passed to `HiveClient` on 
creating. So, The workaround is using `--hiveconf`.

We had better handle this case without misleading error messages.


> Support Dynamic Partition Inserts params with SET command
> -
>
> Key: SPARK-19881
> URL: https://issues.apache.org/jira/browse/SPARK-19881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> ## What changes were proposed in this pull request?
> Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In 
> most case, Spark handles well. However, for the dynamic partition insert, 
> users meet the following misleading situation. 
> {code}
> scala> spark.range(1001).selectExpr("id as key", "id as 
> value").registerTempTable("t1001")
> scala> sql("create table p (value int) partitioned by (key int)").show
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.spark.SparkException:
> Dynamic partition strict mode requires at least one static partition column.
> To turn this off set hive.exec.dynamic.partition.mode=nonstrict
> scala>

[jira] [Assigned] (SPARK-19881) Support Dynamic Partition Inserts params with SET command

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19881:


Assignee: (was: Apache Spark)

> Support Dynamic Partition Inserts params with SET command
> -
>
> Key: SPARK-19881
> URL: https://issues.apache.org/jira/browse/SPARK-19881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, `SET` command does not pass the values to Hive. In most case, 
> Spark handles well. However, for the dynamic partition insert, users meet the 
> following situation. 
> {code}
> scala> spark.range(1001).selectExpr("id as key", "id as 
> value").registerTempTable("t1001")
> scala> sql("create table p (value int) partitioned by (key int)").show
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.spark.SparkException: Dynamic partition strict mode requires at 
> least one static partition column. To turn this off set 
> hive.exec.dynamic.partition.mode=nonstrict
> scala> sql("set hive.exec.dynamic.partition.mode=nonstrict")
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic 
> partitions created is 1001, which is more than 1000. To solve this try to set 
> hive.exec.max.dynamic.partitions to at least 1001.
> scala> sql("set hive.exec.dynamic.partition.mode=1001")
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic 
> partitions created is 1001, which is more than 1000. To solve this try to set 
> hive.exec.max.dynamic.partitions to at least 1001.
> <== Repeat the same error message.
> {code}
> The root cause is that `hive` parameters are passed to `HiveClient` on 
> creating. So, The workaround is using `--hiveconf`.
> We had better handle this case without misleading error messages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19881) Support Dynamic Partition Inserts params with SET command

2017-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902614#comment-15902614
 ] 

Apache Spark commented on SPARK-19881:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17223

> Support Dynamic Partition Inserts params with SET command
> -
>
> Key: SPARK-19881
> URL: https://issues.apache.org/jira/browse/SPARK-19881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, `SET` command does not pass the values to Hive. In most case, 
> Spark handles well. However, for the dynamic partition insert, users meet the 
> following situation. 
> {code}
> scala> spark.range(1001).selectExpr("id as key", "id as 
> value").registerTempTable("t1001")
> scala> sql("create table p (value int) partitioned by (key int)").show
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.spark.SparkException: Dynamic partition strict mode requires at 
> least one static partition column. To turn this off set 
> hive.exec.dynamic.partition.mode=nonstrict
> scala> sql("set hive.exec.dynamic.partition.mode=nonstrict")
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic 
> partitions created is 1001, which is more than 1000. To solve this try to set 
> hive.exec.max.dynamic.partitions to at least 1001.
> scala> sql("set hive.exec.dynamic.partition.mode=1001")
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic 
> partitions created is 1001, which is more than 1000. To solve this try to set 
> hive.exec.max.dynamic.partitions to at least 1001.
> <== Repeat the same error message.
> {code}
> The root cause is that `hive` parameters are passed to `HiveClient` on 
> creating. So, The workaround is using `--hiveconf`.
> We had better handle this case without misleading error messages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19881) Support Dynamic Partition Inserts params with SET command

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19881:


Assignee: Apache Spark

> Support Dynamic Partition Inserts params with SET command
> -
>
> Key: SPARK-19881
> URL: https://issues.apache.org/jira/browse/SPARK-19881
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, `SET` command does not pass the values to Hive. In most case, 
> Spark handles well. However, for the dynamic partition insert, users meet the 
> following situation. 
> {code}
> scala> spark.range(1001).selectExpr("id as key", "id as 
> value").registerTempTable("t1001")
> scala> sql("create table p (value int) partitioned by (key int)").show
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.spark.SparkException: Dynamic partition strict mode requires at 
> least one static partition column. To turn this off set 
> hive.exec.dynamic.partition.mode=nonstrict
> scala> sql("set hive.exec.dynamic.partition.mode=nonstrict")
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic 
> partitions created is 1001, which is more than 1000. To solve this try to set 
> hive.exec.max.dynamic.partitions to at least 1001.
> scala> sql("set hive.exec.dynamic.partition.mode=1001")
> scala> sql("insert into table p partition(key) select key, value from t1001")
> org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic 
> partitions created is 1001, which is more than 1000. To solve this try to set 
> hive.exec.max.dynamic.partitions to at least 1001.
> <== Repeat the same error message.
> {code}
> The root cause is that `hive` parameters are passed to `HiveClient` on 
> creating. So, The workaround is using `--hiveconf`.
> We had better handle this case without misleading error messages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19881) Support Dynamic Partition Inserts params with SET command

2017-03-08 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-19881:
-

 Summary: Support Dynamic Partition Inserts params with SET command
 Key: SPARK-19881
 URL: https://issues.apache.org/jira/browse/SPARK-19881
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.0
Reporter: Dongjoon Hyun
Priority: Minor


Currently, `SET` command does not pass the values to Hive. In most case, Spark 
handles well. However, for the dynamic partition insert, users meet the 
following situation. 

{code}
scala> spark.range(1001).selectExpr("id as key", "id as 
value").registerTempTable("t1001")

scala> sql("create table p (value int) partitioned by (key int)").show

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.spark.SparkException: Dynamic partition strict mode requires at 
least one static partition column. To turn this off set 
hive.exec.dynamic.partition.mode=nonstrict

scala> sql("set hive.exec.dynamic.partition.mode=nonstrict")

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions 
created is 1001, which is more than 1000. To solve this try to set 
hive.exec.max.dynamic.partitions to at least 1001.

scala> sql("set hive.exec.dynamic.partition.mode=1001")

scala> sql("insert into table p partition(key) select key, value from t1001")
org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions 
created is 1001, which is more than 1000. To solve this try to set 
hive.exec.max.dynamic.partitions to at least 1001.

<== Repeat the same error message.
{code}

The root cause is that `hive` parameters are passed to `HiveClient` on 
creating. So, The workaround is using `--hiveconf`.

We had better handle this case without misleading error messages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19439) PySpark's registerJavaFunction Should Support UDAFs

2017-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902601#comment-15902601
 ] 

Apache Spark commented on SPARK-19439:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17222

> PySpark's registerJavaFunction Should Support UDAFs
> ---
>
> Key: SPARK-19439
> URL: https://issues.apache.org/jira/browse/SPARK-19439
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Keith Bourgoin
>
> When trying to import a Scala UDAF using registerJavaFunction, I get this 
> error:
> {code}
> In [1]: sqlContext.registerJavaFunction('geo_mean', 
> 'com.foo.bar.GeometricMean')
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sqlContext.registerJavaFunction('geo_mean', 
> 'com.foo.bar.GeometricMean')
> /home/kfb/src/projects/spark/python/pyspark/sql/context.pyc in 
> registerJavaFunction(self, name, javaClassName, returnType)
> 227 if returnType is not None:
> 228 jdt = 
> self.sparkSession._jsparkSession.parseDataType(returnType.json())
> --> 229 self.sparkSession._jsparkSession.udf().registerJava(name, 
> javaClassName, jdt)
> 230
> 231 # TODO(andrew): delete this once we refactor things to take in 
> SparkSession
> /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /home/kfb/src/projects/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py 
> in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o28.registerJava.
> : java.io.IOException: UDF class com.foo.bar.GeometricMean doesn't implement 
> any UDF interface
>   at 
> org.apache.spark.sql.UDFRegistration.registerJava(UDFRegistration.scala:438)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> According to SPARK-10915, UDAFs in Python aren't happening anytime soon. 
> Without this, there's no way to get Scala UDAFs into Python Spark SQL 
> whatsoever. Fixing that would be a huge help so that we can keep aggregations 
> in the JVM and using DataFrames. Otherwise, all our code has to drop to to 
> RDDs and live in Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19439) PySpark's registerJavaFunction Should Support UDAFs

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19439:


Assignee: Apache Spark

> PySpark's registerJavaFunction Should Support UDAFs
> ---
>
> Key: SPARK-19439
> URL: https://issues.apache.org/jira/browse/SPARK-19439
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Keith Bourgoin
>Assignee: Apache Spark
>
> When trying to import a Scala UDAF using registerJavaFunction, I get this 
> error:
> {code}
> In [1]: sqlContext.registerJavaFunction('geo_mean', 
> 'com.foo.bar.GeometricMean')
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sqlContext.registerJavaFunction('geo_mean', 
> 'com.foo.bar.GeometricMean')
> /home/kfb/src/projects/spark/python/pyspark/sql/context.pyc in 
> registerJavaFunction(self, name, javaClassName, returnType)
> 227 if returnType is not None:
> 228 jdt = 
> self.sparkSession._jsparkSession.parseDataType(returnType.json())
> --> 229 self.sparkSession._jsparkSession.udf().registerJava(name, 
> javaClassName, jdt)
> 230
> 231 # TODO(andrew): delete this once we refactor things to take in 
> SparkSession
> /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /home/kfb/src/projects/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py 
> in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o28.registerJava.
> : java.io.IOException: UDF class com.foo.bar.GeometricMean doesn't implement 
> any UDF interface
>   at 
> org.apache.spark.sql.UDFRegistration.registerJava(UDFRegistration.scala:438)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> According to SPARK-10915, UDAFs in Python aren't happening anytime soon. 
> Without this, there's no way to get Scala UDAFs into Python Spark SQL 
> whatsoever. Fixing that would be a huge help so that we can keep aggregations 
> in the JVM and using DataFrames. Otherwise, all our code has to drop to to 
> RDDs and live in Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19439) PySpark's registerJavaFunction Should Support UDAFs

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19439:


Assignee: (was: Apache Spark)

> PySpark's registerJavaFunction Should Support UDAFs
> ---
>
> Key: SPARK-19439
> URL: https://issues.apache.org/jira/browse/SPARK-19439
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Keith Bourgoin
>
> When trying to import a Scala UDAF using registerJavaFunction, I get this 
> error:
> {code}
> In [1]: sqlContext.registerJavaFunction('geo_mean', 
> 'com.foo.bar.GeometricMean')
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sqlContext.registerJavaFunction('geo_mean', 
> 'com.foo.bar.GeometricMean')
> /home/kfb/src/projects/spark/python/pyspark/sql/context.pyc in 
> registerJavaFunction(self, name, javaClassName, returnType)
> 227 if returnType is not None:
> 228 jdt = 
> self.sparkSession._jsparkSession.parseDataType(returnType.json())
> --> 229 self.sparkSession._jsparkSession.udf().registerJava(name, 
> javaClassName, jdt)
> 230
> 231 # TODO(andrew): delete this once we refactor things to take in 
> SparkSession
> /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /home/kfb/src/projects/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py 
> in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o28.registerJava.
> : java.io.IOException: UDF class com.foo.bar.GeometricMean doesn't implement 
> any UDF interface
>   at 
> org.apache.spark.sql.UDFRegistration.registerJava(UDFRegistration.scala:438)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> According to SPARK-10915, UDAFs in Python aren't happening anytime soon. 
> Without this, there's no way to get Scala UDAFs into Python Spark SQL 
> whatsoever. Fixing that would be a huge help so that we can keep aggregations 
> in the JVM and using DataFrames. Otherwise, all our code has to drop to to 
> RDDs and live in Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"

2017-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19874.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> Hide API docs for "org.apache.spark.sql.internal"
> -
>
> Key: SPARK-19874
> URL: https://issues.apache.org/jira/browse/SPARK-19874
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> The API docs should not include the "org.apache.spark.sql.internal" package 
> because they are internal private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"

2017-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19874:
-
Priority: Minor  (was: Major)

> Hide API docs for "org.apache.spark.sql.internal"
> -
>
> Key: SPARK-19874
> URL: https://issues.apache.org/jira/browse/SPARK-19874
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> The API docs should not include the "org.apache.spark.sql.internal" package 
> because they are internal private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19235) Enable Test Cases in DDLSuite with Hive Metastore

2017-03-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19235.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16592
[https://github.com/apache/spark/pull/16592]

> Enable Test Cases in DDLSuite with Hive Metastore
> -
>
> Key: SPARK-19235
> URL: https://issues.apache.org/jira/browse/SPARK-19235
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> So far, the test cases in DDLSuites only verify the behaviors of 
> InMemoryCatalog. That means, they do not cover the scenarios using 
> HiveExternalCatalog. Thus, we need to improve the existing test suite to run 
> these cases using Hive metastore.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes

2017-03-08 Thread Jim Kleckner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902537#comment-15902537
 ] 

Jim Kleckner edited comment on SPARK-11141 at 3/9/17 5:54 AM:
--

FYI, this can cause problems when not using S3 during shutdown as described in 
this AWS posting: https://forums.aws.amazon.com/thread.jspa?threadID=223378

The workaround indicated is to use --conf 
spark.streaming.driver.writeAheadLog.allowBatching=false with the submit.

The exception contains the text:
{code}
streaming stop ReceivedBlockTracker: Exception thrown while writing record: 
BatchAllocationEvent
{code}


was (Author: jkleckner):
FYI, this can cause problems when not using S3 during shutdown as described in 
this AWS posting: https://forums.aws.amazon.com/thread.jspa?threadID=223378

The workaround indicated is to use --conf 
spark.streaming.driver.writeAheadLog.allowBatching=false with the submit.

> Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
> --
>
> Key: SPARK-11141
> URL: https://issues.apache.org/jira/browse/SPARK-11141
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 1.6.0
>
>
> When using S3 as a directory for WALs, the writes take too long. The driver 
> gets very easily bottlenecked when multiple receivers send AddBlock events to 
> the ReceiverTracker. This PR adds batching of events in the 
> ReceivedBlockTracker so that receivers don't get blocked by the driver for 
> too long.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes

2017-03-08 Thread Jim Kleckner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902537#comment-15902537
 ] 

Jim Kleckner commented on SPARK-11141:
--

FYI, this can cause problems when not using S3 during shutdown as described in 
this AWS posting: https://forums.aws.amazon.com/thread.jspa?threadID=223378

The workaround indicated is to use --conf 
spark.streaming.driver.writeAheadLog.allowBatching=false with the submit.

> Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
> --
>
> Key: SPARK-11141
> URL: https://issues.apache.org/jira/browse/SPARK-11141
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 1.6.0
>
>
> When using S3 as a directory for WALs, the writes take too long. The driver 
> gets very easily bottlenecked when multiple receivers send AddBlock events to 
> the ReceiverTracker. This PR adds batching of events in the 
> ReceivedBlockTracker so that receivers don't get blocked by the driver for 
> too long.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19866) Add local version of Word2Vec findSynonyms for spark.ml: Python API

2017-03-08 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902522#comment-15902522
 ] 

Xin Ren commented on SPARK-19866:
-

I can try this one :)

> Add local version of Word2Vec findSynonyms for spark.ml: Python API
> ---
>
> Key: SPARK-19866
> URL: https://issues.apache.org/jira/browse/SPARK-19866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add Python API for findSynonymsArray matching Scala API in linked JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19880) About spark2.0.2 and spark1.4.1 beeline to show the database, use the default operation such as dealing with different

2017-03-08 Thread guoxiaolong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolong updated SPARK-19880:

Description: 
About spark2.0.2 and spark1.4.1 beeline to show the database, use the default 
operation such as dealing with different
.why show databases,use default such operation need execute  a job in  
spark2.0.2 .When a job task  is very much, time is very long,such as query 
operation, lead to the back of the show databases, use the default operations 
such as waiting in line.But When a job task  is very much, time is very 
long,such as query operation, lead to the back of the show databases, use the 
default operations no need to wait

  was:
About spark2.0.2 and spark1.4.1 beeline to show the database, use the default 
operation such as dealing with different
.why show databases,use default such operation need execute  a job in  
spark2.0.2 .When a job task  is very much, time is very long,such as query 
operation, lead to the back of the show databases, use the default operations 
such as waiting in line


> About spark2.0.2 and spark1.4.1 beeline to show the database, use the default 
> operation such as dealing with different
> --
>
> Key: SPARK-19880
> URL: https://issues.apache.org/jira/browse/SPARK-19880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: guoxiaolong
>
> About spark2.0.2 and spark1.4.1 beeline to show the database, use the default 
> operation such as dealing with different
> .why show databases,use default such operation need execute  a job in  
> spark2.0.2 .When a job task  is very much, time is very long,such as query 
> operation, lead to the back of the show databases, use the default operations 
> such as waiting in line.But When a job task  is very much, time is very 
> long,such as query operation, lead to the back of the show databases, use the 
> default operations no need to wait



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19880) About spark2.0.2 and spark1.4.1 beeline to show the database, use the default operation such as dealing with different

2017-03-08 Thread guoxiaolong (JIRA)

guoxiaolong created SPARK-19880:
---

 Summary: About spark2.0.2 and spark1.4.1 beeline to show the 
database, use the default operation such as dealing with different
 Key: SPARK-19880
 URL: https://issues.apache.org/jira/browse/SPARK-19880
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: guoxiaolong


About spark2.0.2 and spark1.4.1 beeline to show the database, use the default 
operation such as dealing with different
.why show databases,use default such operation need execute  a job in  
spark2.0.2 .When a job task  is very much, time is very long,such as query 
operation, lead to the back of the show databases, use the default operations 
such as waiting in line



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19862:


Assignee: (was: Apache Spark)

> In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. 
> -
>
> Key: SPARK-19862
> URL: https://issues.apache.org/jira/browse/SPARK-19862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: guoxiaolong
>Priority: Trivial
>
> "tungsten-sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be 
> deleted. Because it is the same of "sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19862:


Assignee: Apache Spark

> In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. 
> -
>
> Key: SPARK-19862
> URL: https://issues.apache.org/jira/browse/SPARK-19862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: guoxiaolong
>Assignee: Apache Spark
>Priority: Trivial
>
> "tungsten-sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be 
> deleted. Because it is the same of "sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.

2017-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902407#comment-15902407
 ] 

Apache Spark commented on SPARK-19862:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/17220

> In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. 
> -
>
> Key: SPARK-19862
> URL: https://issues.apache.org/jira/browse/SPARK-19862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: guoxiaolong
>Priority: Trivial
>
> "tungsten-sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be 
> deleted. Because it is the same of "sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala

2017-03-08 Thread kavn qin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kavn qin updated SPARK-19878:
-
External issue URL:   (was: 
https://issues.apache.org/jira/browse/SPARK-17920)
 External issue ID:   (was: SPARK-17920)
Issue Type: Improvement  (was: Wish)

> Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
> --
>
> Key: SPARK-19878
> URL: https://issues.apache.org/jira/browse/SPARK-19878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0
> Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0
>Reporter: kavn qin
>Priority: Minor
>  Labels: patch
> Attachments: SPARK-19878.patch
>
>
> When case class InsertIntoHiveTable intializes a serde it explicitly passes 
> null for the Configuration in Spark 1.5.0:
> [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]
> While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
> passes null for the Configuration:
> [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]
> When we implement a hive serde, we want to use the hive configuration to  get 
> some static and dynamic settings, but we can not do it !
> So this patch add the configuration when initialize hive serde.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala

2017-03-08 Thread kavn qin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kavn qin updated SPARK-19878:
-
Attachment: SPARK-19878.patch

> Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
> --
>
> Key: SPARK-19878
> URL: https://issues.apache.org/jira/browse/SPARK-19878
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0
> Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0
>Reporter: kavn qin
>Priority: Minor
>  Labels: patch
> Attachments: SPARK-19878.patch
>
>
> When case class InsertIntoHiveTable intializes a serde it explicitly passes 
> null for the Configuration in Spark 1.5.0:
> [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]
> While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
> passes null for the Configuration:
> [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]
> When we implement a hive serde, we want to use the hive configuration to  get 
> some static and dynamic settings, but we can not do it !
> So this patch add the configuration when initialize hive serde.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala

2017-03-08 Thread kavn qin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kavn qin updated SPARK-19878:
-
Attachment: (was: SPARK-19878.patch)

> Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
> --
>
> Key: SPARK-19878
> URL: https://issues.apache.org/jira/browse/SPARK-19878
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0
> Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0
>Reporter: kavn qin
>Priority: Minor
>  Labels: patch
>
> When case class InsertIntoHiveTable intializes a serde it explicitly passes 
> null for the Configuration in Spark 1.5.0:
> [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]
> While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
> passes null for the Configuration:
> [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]
> When we implement a hive serde, we want to use the hive configuration to  get 
> some static and dynamic settings, but we can not do it !
> So this patch add the configuration when initialize hive serde.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala

2017-03-08 Thread kavn qin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kavn qin updated SPARK-19878:
-
Description: 
When case class InsertIntoHiveTable intializes a serde it explicitly passes 
null for the Configuration in Spark 1.5.0:

[https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]

While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
passes null for the Configuration:

[https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]

When we implement a hive serde, we want to use the hive configuration to  get 
some static and dynamic settings, but we can not do it !

So this patch add the configuration when initialize hive serde.

  was:
When case class InsertIntoHiveTable intializes a serde it explicitly passes 
null for the Configuration in Spark 1.5.0:

[https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]

While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
passes null for the Configuration:

[https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]

When we implement a hive serde, we want to use the hive configuration to  get 
some static and dynamic settings, we can not do it!

This patch just add the configuration when initialize hive serde.


> Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
> --
>
> Key: SPARK-19878
> URL: https://issues.apache.org/jira/browse/SPARK-19878
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0
> Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0
>Reporter: kavn qin
>Priority: Minor
>  Labels: patch
> Attachments: SPARK-19878.patch
>
>
> When case class InsertIntoHiveTable intializes a serde it explicitly passes 
> null for the Configuration in Spark 1.5.0:
> [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]
> While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
> passes null for the Configuration:
> [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]
> When we implement a hive serde, we want to use the hive configuration to  get 
> some static and dynamic settings, but we can not do it !
> So this patch add the configuration when initialize hive serde.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-12180) DataFrame.join() in PySpark gives misleading exception when column name exists on both side

2017-03-08 Thread Abhishek Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Kumar updated SPARK-12180:
---
Comment: was deleted

(was: Is there any concrete solution or reason explaining the issue ? I am 
facing this in Spark 1.6.2 when I try to join two data-frames derived from same 
source data-frame.)

> DataFrame.join() in PySpark gives misleading exception when column name 
> exists on both side
> ---
>
> Key: SPARK-12180
> URL: https://issues.apache.org/jira/browse/SPARK-12180
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Daniel Thomas
>
> When joining two DataFrames on a column 'session_uuid' I got the following 
> exception, because both DataFrames hat a column called 'at'. The exception is 
> misleading in the cause and in the column causing the problem. Renaming the 
> column fixed the exception.
> ---
> Py4JJavaError Traceback (most recent call last)
> /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /Applications/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling o484.join.
> : org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> session_uuid#3278 missing from 
> uuid_x#9078,total_session_sec#9115L,at#3248,session_uuid#9114,uuid#9117,at#9084
>  in operator !Join Inner, Some((uuid_x#9078 = session_uuid#3278));
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> AnalysisException Traceback (most recent call last)
>  in ()
>   1 sel_starts = starts.select('uuid', 'at').withColumnRenamed('uuid', 
> 'uuid_x')#.withColumnRenamed('at', 'at_x')
>   2 sel_closes = closes.select('uuid', 'at', 'session_uuid', 
> 'total_session_sec')
> > 3 start_close = sel_starts.join(sel_closes, sel_starts['uuid_x'] == 
> sel_closes['session_uuid'])
>   4 start_close.cache()
>   5 start_close.take(1)
> /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/dataframe.py in 
> join(self, other, on, how)
> 579 on = on[0]
> 580 if how is None:
> --> 581 jdf = self._jdf.join(other._jdf, on._jc, "inner")
> 582 else:
> 583 assert isinstance(how, basestring), "how should be 
> basestring"
>

[jira] [Updated] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala

2017-03-08 Thread kavn qin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kavn qin updated SPARK-19878:
-
Attachment: SPARK-19878.patch

> Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
> --
>
> Key: SPARK-19878
> URL: https://issues.apache.org/jira/browse/SPARK-19878
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0, 2.0.0
> Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0
>Reporter: kavn qin
>Priority: Minor
>  Labels: patch
> Attachments: SPARK-19878.patch
>
>
> When case class InsertIntoHiveTable intializes a serde it explicitly passes 
> null for the Configuration in Spark 1.5.0:
> [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]
> While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
> passes null for the Configuration:
> [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]
> When we implement a hive serde, we want to use the hive configuration to  get 
> some static and dynamic settings, we can not do it!
> This patch just add the configuration when initialize hive serde.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19879) Spark UI table sort breaks event timeline

2017-03-08 Thread Sebastian Estevez (JIRA)

Sebastian Estevez created SPARK-19879:
-

 Summary: Spark UI table sort breaks event timeline
 Key: SPARK-19879
 URL: https://issues.apache.org/jira/browse/SPARK-19879
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.6.3
Reporter: Sebastian Estevez
Priority: Minor


When you sort the task list at the bottom of the page the new sorting order 
seems to get carried into the event timeline visualization placing tasks in the 
wrong hosts and across the wrong time periods.

This is especially evident if a task has a different number of tasks running 
per executor or if there are some outlier long running tasks in one or a few of 
the executors.

http://:4040/stages/stage/?id=0=0

http://:4040/stages/stage/?id=0=0=Duration=true=100



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala

2017-03-08 Thread kavn qin (JIRA)

kavn qin created SPARK-19878:


 Summary: Add hive configuration when initialize hive serde in 
InsertIntoHiveTable.scala
 Key: SPARK-19878
 URL: https://issues.apache.org/jira/browse/SPARK-19878
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 2.0.0, 1.6.0, 1.5.0
 Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0
Reporter: kavn qin
Priority: Minor


When case class InsertIntoHiveTable intializes a serde it explicitly passes 
null for the Configuration in Spark 1.5.0:

[https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58]

While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just 
passes null for the Configuration:

[https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161]

When we implement a hive serde, we want to use the hive configuration to  get 
some static and dynamic settings, we can not do it!

This patch just add the configuration when initialize hive serde.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12180) DataFrame.join() in PySpark gives misleading exception when column name exists on both side

2017-03-08 Thread Abhishek Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902389#comment-15902389
 ] 

Abhishek Kumar edited comment on SPARK-12180 at 3/9/17 2:52 AM:


Is there any concrete solution or reason explaining the issue ? I am facing 
this in Spark 1.6.2 when I try to join two data-frames derived from same source 
data-frame.


was (Author: abhikumar):
Is there any concrete solution or reason explaining the issue ? I have facing 
this in Spark 1.6.2 when I try to join two data-frames derived from same source 
data-frame.

> DataFrame.join() in PySpark gives misleading exception when column name 
> exists on both side
> ---
>
> Key: SPARK-12180
> URL: https://issues.apache.org/jira/browse/SPARK-12180
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Daniel Thomas
>
> When joining two DataFrames on a column 'session_uuid' I got the following 
> exception, because both DataFrames hat a column called 'at'. The exception is 
> misleading in the cause and in the column causing the problem. Renaming the 
> column fixed the exception.
> ---
> Py4JJavaError Traceback (most recent call last)
> /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /Applications/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling o484.join.
> : org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> session_uuid#3278 missing from 
> uuid_x#9078,total_session_sec#9115L,at#3248,session_uuid#9114,uuid#9117,at#9084
>  in operator !Join Inner, Some((uuid_x#9078 = session_uuid#3278));
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> AnalysisException Traceback (most recent call last)
>  in ()
>   1 sel_starts = starts.select('uuid', 'at').withColumnRenamed('uuid', 
> 'uuid_x')#.withColumnRenamed('at', 'at_x')
>   2 sel_closes = closes.select('uuid', 'at', 'session_uuid', 
> 'total_session_sec')
> > 3 start_close = sel_starts.join(sel_closes, sel_starts['uuid_x'] == 
> sel_closes['session_uuid'])
>   4 start_close.cache()
>   5 start_close.take(1)
> /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/dataframe.py in 
> join(self, other, on, how)
> 579 on = on[0]
> 580 if how is None:
> -->

[jira] [Commented] (SPARK-12180) DataFrame.join() in PySpark gives misleading exception when column name exists on both side

2017-03-08 Thread Abhishek Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902389#comment-15902389
 ] 

Abhishek Kumar commented on SPARK-12180:


Is there any concrete solution or reason explaining the issue ? I have facing 
this in Spark 1.6.2 when I try to join two data-frames derived from same source 
data-frame.

> DataFrame.join() in PySpark gives misleading exception when column name 
> exists on both side
> ---
>
> Key: SPARK-12180
> URL: https://issues.apache.org/jira/browse/SPARK-12180
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Daniel Thomas
>
> When joining two DataFrames on a column 'session_uuid' I got the following 
> exception, because both DataFrames hat a column called 'at'. The exception is 
> misleading in the cause and in the column causing the problem. Renaming the 
> column fixed the exception.
> ---
> Py4JJavaError Traceback (most recent call last)
> /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /Applications/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling o484.join.
> : org.apache.spark.sql.AnalysisException: resolved attribute(s) 
> session_uuid#3278 missing from 
> uuid_x#9078,total_session_sec#9115L,at#3248,session_uuid#9114,uuid#9117,at#9084
>  in operator !Join Inner, Some((uuid_x#9078 = session_uuid#3278));
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> AnalysisException Traceback (most recent call last)
>  in ()
>   1 sel_starts = starts.select('uuid', 'at').withColumnRenamed('uuid', 
> 'uuid_x')#.withColumnRenamed('at', 'at_x')
>   2 sel_closes = closes.select('uuid', 'at', 'session_uuid', 
> 'total_session_sec')
> > 3 start_close = sel_starts.join(sel_closes, sel_starts['uuid_x'] == 
> sel_closes['session_uuid'])
>   4 start_close.cache()
>   5 start_close.take(1)
> /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/dataframe.py in 
> join(self, other, on, how)
> 579 on = on[0]
> 580 if how is None:
> --> 581 jdf = self._jdf.join(other._jdf, on._jc, "inner")
> 582 else:
> 583 assert isinstance(how, basestring), "how should be 
> basestring"
>

[jira] [Commented] (SPARK-19859) The new watermark should override the old one

2017-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902368#comment-15902368
 ] 

Apache Spark commented on SPARK-19859:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17221

> The new watermark should override the old one
> -
>
> Key: SPARK-19859
> URL: https://issues.apache.org/jira/browse/SPARK-19859
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> The new watermark should override the old one. Otherwise, we just pick up the 
> first column which has a watermark, it may be unexpected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19877) Restrict the depth of view reference chains

2017-03-08 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-19877:


 Summary: Restrict the depth of view reference chains
 Key: SPARK-19877
 URL: https://issues.apache.org/jira/browse/SPARK-19877
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jiang Xingbo


We should restrict the depth of of view reference chains, to avoid stack 
overflow exception during resolution of nested views.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19877) Restrict the depth of view reference chains

2017-03-08 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902324#comment-15902324
 ] 

Jiang Xingbo commented on SPARK-19877:
--

I'm working on this.

> Restrict the depth of view reference chains
> ---
>
> Key: SPARK-19877
> URL: https://issues.apache.org/jira/browse/SPARK-19877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jiang Xingbo
>
> We should restrict the depth of of view reference chains, to avoid stack 
> overflow exception during resolution of nested views.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19808) About the default blocking arg in unpersist

2017-03-08 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902317#comment-15902317
 ] 

zhengruifeng commented on SPARK-19808:
--

[~srowen] Agreed. Changing the default may cause latent issues. I will close 
this jira.

> About the default blocking arg in unpersist
> ---
>
> Key: SPARK-19808
> URL: https://issues.apache.org/jira/browse/SPARK-19808
> Project: Spark
>  Issue Type: Question
>  Components: ML, Spark Core
>Affects Versions: 2.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Now, {{unpersist}} are commonly used with default value in ML.
> Most algorithms like {{KMeans}} use {{RDD.unpersisit}} and the default 
> {{blocking}} is {{true}}
> And for meta algorithms like {{OneVsRest}}, {{CrossValidator}} use 
> {{Dataset.unpersist}} and the default {{blocking}} is {{false}}
> Should the default value for {{RDD.unpersisit}} and {{Dataset.unpersist}} be 
> consistent?
> And all the {{blocking}} arg in ML should be set {{false}}?
> [~srowen] [~mlnick] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19808) About the default blocking arg in unpersist

2017-03-08 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng closed SPARK-19808.

Resolution: Not A Problem

> About the default blocking arg in unpersist
> ---
>
> Key: SPARK-19808
> URL: https://issues.apache.org/jira/browse/SPARK-19808
> Project: Spark
>  Issue Type: Question
>  Components: ML, Spark Core
>Affects Versions: 2.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Now, {{unpersist}} are commonly used with default value in ML.
> Most algorithms like {{KMeans}} use {{RDD.unpersisit}} and the default 
> {{blocking}} is {{true}}
> And for meta algorithms like {{OneVsRest}}, {{CrossValidator}} use 
> {{Dataset.unpersist}} and the default {{blocking}} is {{false}}
> Should the default value for {{RDD.unpersisit}} and {{Dataset.unpersist}} be 
> consistent?
> And all the {{blocking}} arg in ML should be set {{false}}?
> [~srowen] [~mlnick] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2017-03-08 Thread chenerlu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901132#comment-15901132
 ] 

chenerlu edited comment on SPARK-16283 at 3/9/17 1:55 AM:
--

Hi, I am little confused about percentile_approx, is it different from hive's 
now ? will we get different result when the input is same ?

for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) 
from test; and get different result.

c4_double is show below:
1.0001
2.0001
3.0001
4.0001
5.0001
6.0001
7.0001
8.0001
9.0001
NULL
-8.952
-96.0

Hive:
[-87.2952,-6.9615799,1.30009998,2.40010003]

spark 2.x:
[-8.952,1.0001,2.0001,3.0001]

so which result is right ? Could you pls reply me when you are free.

[~rxin] [~lwlin]


was (Author: erlu):
Hi, I am little confused about percentile_approx, is it different from hive's 
now ? will we get different result when the input is same ?

for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) 
from test; and get different result.

c4_double is show below:
1.0001
2.0001
3.0001
4.0001
5.0001
6.0001
7.0001
8.0001
9.0001
NULL
-8.952
-96.0

Hive:
[-87.2952,-6.9615799,1.30009998,2.40010003]

spark 2.x:
[-8.952,1.0001,2.0001,3.0001]

so which result is right ? Could you pls reply me when you are free.

[~rxin] [~linwei]

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2017-03-08 Thread chenerlu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901132#comment-15901132
 ] 

chenerlu edited comment on SPARK-16283 at 3/9/17 1:55 AM:
--

Hi, I am little confused about percentile_approx, is it different from hive's 
now ? will we get different result when the input is same ?

for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) 
from test; and get different result.

c4_double is show below:
1.0001
2.0001
3.0001
4.0001
5.0001
6.0001
7.0001
8.0001
9.0001
NULL
-8.952
-96.0

Hive:
[-87.2952,-6.9615799,1.30009998,2.40010003]

spark 2.x:
[-8.952,1.0001,2.0001,3.0001]

so which result is right ? Could you pls reply me when you are free.

[~rxin] [~linwei]


was (Author: erlu):
Hi, I am little confused about percentile_approx, is it different from hive's 
now ? will we get different result when the input is same ?

for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) 
from test; and get different result.

c4_double is show below:
1.0001
2.0001
3.0001
4.0001
5.0001
6.0001
7.0001
8.0001
9.0001
NULL
-8.952
-96.0

Hive:
[-87.2952,-6.9615799,1.30009998,2.40010003]

spark 2.x:
[-8.952,1.0001,2.0001,3.0001]

so which result is right ? Could you pls reply me when you are free.



> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.

2017-03-08 Thread guoxiaolong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902263#comment-15902263
 ] 

guoxiaolong commented on SPARK-19862:
-

remove tungsten-sort.Because it is not represent 
'org.apache.spark.shuffle.unsafe.UnsafeShuffleManager'.so,it should be deleted.

> In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. 
> -
>
> Key: SPARK-19862
> URL: https://issues.apache.org/jira/browse/SPARK-19862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: guoxiaolong
>Priority: Trivial
>
> "tungsten-sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be 
> deleted. Because it is the same of "sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data

2017-03-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19507.
--
Resolution: Duplicate

Actually, it seems a duplicate of SPARK-19871. Let me resolve this one as that 
one already has a PR. Please reopen this if I am mistaken.

> pyspark.sql.types._verify_type() exceptions too broad to debug collections or 
> nested data
> -
>
> Key: SPARK-19507
> URL: https://issues.apache.org/jira/browse/SPARK-19507
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Priority: Trivial
>
> The private function pyspark.sql.types._verify_type() recursively checks an 
> object against a datatype, raising an exception if the object does not 
> satisfy the type.  These messages are not specific enough to debug a data 
> error in a collection or nested data, for instance:
> {quote}
> >>> import pyspark.sql.types as typ
> >>> schema = typ.StructType([typ.StructField('nest1', 
> >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(])
> >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in 
> _verify_type
> _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in 
> _verify_type
> _verify_type(v, dataType.valueType, dataType.valueContainsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in 
> _verify_type
> _verify_type(i, dataType.elementType, dataType.containsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in 
> _verify_type
> % (name, dataType, obj, type(obj)))
> TypeError: FloatType can not accept object 1 in type 
> {quote}
> Passing and printing a field name would make debugging easier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19876) Add OneTime trigger executor

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19876:


Assignee: (was: Apache Spark)

> Add OneTime trigger executor
> 
>
> Key: SPARK-19876
> URL: https://issues.apache.org/jira/browse/SPARK-19876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
> Fix For: 2.2.0
>
>
> The goal is to add a new trigger executor that will process a single trigger 
> then stop. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19876) Add OneTime trigger executor

2017-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902221#comment-15902221
 ] 

Apache Spark commented on SPARK-19876:
--

User 'tcondie' has created a pull request for this issue:
https://github.com/apache/spark/pull/17219

> Add OneTime trigger executor
> 
>
> Key: SPARK-19876
> URL: https://issues.apache.org/jira/browse/SPARK-19876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
> Fix For: 2.2.0
>
>
> The goal is to add a new trigger executor that will process a single trigger 
> then stop. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19876) Add OneTime trigger executor

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19876:


Assignee: Apache Spark

> Add OneTime trigger executor
> 
>
> Key: SPARK-19876
> URL: https://issues.apache.org/jira/browse/SPARK-19876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Apache Spark
> Fix For: 2.2.0
>
>
> The goal is to add a new trigger executor that will process a single trigger 
> then stop. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-08 Thread Brian Bruggeman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Bruggeman updated SPARK-19872:

Priority: Blocker  (was: Major)

> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>Priority: Blocker
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19876) Add OneTime trigger executor

2017-03-08 Thread Tyson Condie (JIRA)

Tyson Condie created SPARK-19876:


 Summary: Add OneTime trigger executor
 Key: SPARK-19876
 URL: https://issues.apache.org/jira/browse/SPARK-19876
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Tyson Condie
 Fix For: 2.2.0


The goal is to add a new trigger executor that will process a single trigger 
then stop. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19281) spark.ml Python API for FPGrowth

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19281:


Assignee: Apache Spark

> spark.ml Python API for FPGrowth
> 
>
> Key: SPARK-19281
> URL: https://issues.apache.org/jira/browse/SPARK-19281
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> See parent issue.  This is for a Python API *after* the Scala API has been 
> designed and implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19281) spark.ml Python API for FPGrowth

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19281:


Assignee: (was: Apache Spark)

> spark.ml Python API for FPGrowth
> 
>
> Key: SPARK-19281
> URL: https://issues.apache.org/jira/browse/SPARK-19281
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> See parent issue.  This is for a Python API *after* the Scala API has been 
> designed and implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19281) spark.ml Python API for FPGrowth

2017-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902198#comment-15902198
 ] 

Apache Spark commented on SPARK-19281:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17218

> spark.ml Python API for FPGrowth
> 
>
> Key: SPARK-19281
> URL: https://issues.apache.org/jira/browse/SPARK-19281
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> See parent issue.  This is for a Python API *after* the Scala API has been 
> designed and implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2017-03-08 Thread Jay Pranavamurthi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Pranavamurthi updated SPARK-19875:
--
Description: 
The attached code (TestFilter.scala) works with a 10-column csv dataset, but 
gets stuck with a 50-column csv dataset. Both datasets are attached.


  was:
The attached code works with a 10-column csv dataset, but gets stuck with a 
50-column csv dataset. Both datasets are attached.



> Map->filter on many columns gets stuck in constraint inference optimization 
> code
> 
>
> Key: SPARK-19875
> URL: https://issues.apache.org/jira/browse/SPARK-19875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: test10cols.csv, test50cols.csv, TestFilter.scala
>
>
> The attached code (TestFilter.scala) works with a 10-column csv dataset, but 
> gets stuck with a 50-column csv dataset. Both datasets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2017-03-08 Thread Jay Pranavamurthi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Pranavamurthi updated SPARK-19875:
--
Attachment: TestFilter.scala
test50cols.csv
test10cols.csv

> Map->filter on many columns gets stuck in constraint inference optimization 
> code
> 
>
> Key: SPARK-19875
> URL: https://issues.apache.org/jira/browse/SPARK-19875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: test10cols.csv, test50cols.csv, TestFilter.scala
>
>
> The attached code works with a 10-column csv dataset, but gets stuck with a 
> 50-column csv dataset. Both datasets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2017-03-08 Thread Jay Pranavamurthi (JIRA)

Jay Pranavamurthi created SPARK-19875:
-

 Summary: Map->filter on many columns gets stuck in constraint 
inference optimization code
 Key: SPARK-19875
 URL: https://issues.apache.org/jira/browse/SPARK-19875
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Jay Pranavamurthi


The attached code works with a 10-column csv dataset, but gets stuck with a 
50-column csv dataset. Both datasets are attached.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env

2017-03-08 Thread Henry Min (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902167#comment-15902167
 ] 

Henry Min commented on SPARK-6936:
--

This issue seems has been fixed on the version 1.5.0. The information here is 
extremely important. Can anyone provide the merged details and show me the 
merged source code?  Because the similar issue still exists on the latest 
version.

> SQLContext.sql() caused deadlock in multi-thread env
> 
>
> Key: SPARK-6936
> URL: https://issues.apache.org/jira/browse/SPARK-6936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: JDK 1.8.x, RedHat
> Linux version 2.6.32-431.23.3.el6.x86_64 
> (mockbu...@x86-027.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red 
> Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014
>Reporter: Paul Wu
>Assignee: Michael Armbrust
>  Labels: deadlock, sql, threading
> Fix For: 1.5.0
>
>
> Doing (the same query) in more than one threads with SQLConext.sql may lead 
> to deadlock. Here is a way to reproduce it (since this is multi-thread issue, 
> the reproduction may or may not be so easy).
> 1. Register a relatively big table.
> 2.  Create two different classes and in the classes, do the same query in a 
> method and put the results in a set and print out the set size.
> 3.  Create two threads to use an object from each class in the run method. 
> Start the threads. For my tests,  it can have a deadlock just in a few runs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19540) Add ability to clone SparkSession with an identical copy of the SessionState

2017-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-19540:


Assignee: Kunal Khamar

> Add ability to clone SparkSession with an identical copy of the SessionState
> 
>
> Key: SPARK-19540
> URL: https://issues.apache.org/jira/browse/SPARK-19540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>Assignee: Kunal Khamar
> Fix For: 2.2.0
>
>
> Forking a newSession() from SparkSession currently makes a new SparkSession 
> that does not retain SessionState (i.e. temporary tables, SQL config, 
> registered functions etc.) This change adds a method cloneSession() which 
> creates a new SparkSession with a copy of the parent's SessionState.
> Subsequent changes to base session are not propagated to cloned session, 
> clone is independent after creation.
> If the base is changed after clone has been created, say user registers new 
> UDF, then the new UDF will not be available inside the clone. Same goes for 
> configs and temp tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19540) Add ability to clone SparkSession with an identical copy of the SessionState

2017-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19540.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add ability to clone SparkSession with an identical copy of the SessionState
> 
>
> Key: SPARK-19540
> URL: https://issues.apache.org/jira/browse/SPARK-19540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
> Fix For: 2.2.0
>
>
> Forking a newSession() from SparkSession currently makes a new SparkSession 
> that does not retain SessionState (i.e. temporary tables, SQL config, 
> registered functions etc.) This change adds a method cloneSession() which 
> creates a new SparkSession with a copy of the parent's SessionState.
> Subsequent changes to base session are not propagated to cloned session, 
> clone is independent after creation.
> If the base is changed after clone has been created, say user registers new 
> UDF, then the new UDF will not be available inside the clone. Same goes for 
> configs and temp tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"

2017-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902161#comment-15902161
 ] 

Apache Spark commented on SPARK-19874:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17217

> Hide API docs for "org.apache.spark.sql.internal"
> -
>
> Key: SPARK-19874
> URL: https://issues.apache.org/jira/browse/SPARK-19874
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The API docs should not include the "org.apache.spark.sql.internal" package 
> because they are internal private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19874:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Hide API docs for "org.apache.spark.sql.internal"
> -
>
> Key: SPARK-19874
> URL: https://issues.apache.org/jira/browse/SPARK-19874
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> The API docs should not include the "org.apache.spark.sql.internal" package 
> because they are internal private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19874:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Hide API docs for "org.apache.spark.sql.internal"
> -
>
> Key: SPARK-19874
> URL: https://issues.apache.org/jira/browse/SPARK-19874
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The API docs should not include the "org.apache.spark.sql.internal" package 
> because they are internal private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"

2017-03-08 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-19874:


 Summary: Hide API docs for "org.apache.spark.sql.internal"
 Key: SPARK-19874
 URL: https://issues.apache.org/jira/browse/SPARK-19874
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.1.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


The API docs should not include the "org.apache.spark.sql.internal" package 
because they are internal private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19873) If the user changes the number of shuffle partitions between batches, Streaming aggregation will fail.

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19873:


Assignee: Apache Spark

> If the user changes the number of shuffle partitions between batches, 
> Streaming aggregation will fail.
> --
>
> Key: SPARK-19873
> URL: https://issues.apache.org/jira/browse/SPARK-19873
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kunal Khamar
>Assignee: Apache Spark
>
> If the user changes the shuffle partition number between batches, Streaming 
> aggregation will fail.
> Here are some possible cases:
> - Change "spark.sql.shuffle.partitions"
> - Use "repartition" and change the partition number in codes
> - RangePartitioner doesn't generate deterministic partitions. Right now it's 
> safe as we disallow sort before aggregation. Not sure if we will add some 
> operators using RangePartitioner in future.
> Fix:
> Record # shuffle partitions in offset log and enforce in next batch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19873) If the user changes the number of shuffle partitions between batches, Streaming aggregation will fail.

2017-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902122#comment-15902122
 ] 

Apache Spark commented on SPARK-19873:
--

User 'kunalkhamar' has created a pull request for this issue:
https://github.com/apache/spark/pull/17216

> If the user changes the number of shuffle partitions between batches, 
> Streaming aggregation will fail.
> --
>
> Key: SPARK-19873
> URL: https://issues.apache.org/jira/browse/SPARK-19873
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kunal Khamar
>
> If the user changes the shuffle partition number between batches, Streaming 
> aggregation will fail.
> Here are some possible cases:
> - Change "spark.sql.shuffle.partitions"
> - Use "repartition" and change the partition number in codes
> - RangePartitioner doesn't generate deterministic partitions. Right now it's 
> safe as we disallow sort before aggregation. Not sure if we will add some 
> operators using RangePartitioner in future.
> Fix:
> Record # shuffle partitions in offset log and enforce in next batch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19873) If the user changes the number of shuffle partitions between batches, Streaming aggregation will fail.

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19873:


Assignee: (was: Apache Spark)

> If the user changes the number of shuffle partitions between batches, 
> Streaming aggregation will fail.
> --
>
> Key: SPARK-19873
> URL: https://issues.apache.org/jira/browse/SPARK-19873
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kunal Khamar
>
> If the user changes the shuffle partition number between batches, Streaming 
> aggregation will fail.
> Here are some possible cases:
> - Change "spark.sql.shuffle.partitions"
> - Use "repartition" and change the partition number in codes
> - RangePartitioner doesn't generate deterministic partitions. Right now it's 
> safe as we disallow sort before aggregation. Not sure if we will add some 
> operators using RangePartitioner in future.
> Fix:
> Record # shuffle partitions in offset log and enforce in next batch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19858:
-
Affects Version/s: (was: 2.1.1)

> Add output mode to flatMapGroupsWithState and disallow invalid cases
> 
>
> Key: SPARK-19858
> URL: https://issues.apache.org/jira/browse/SPARK-19858
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19858.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add output mode to flatMapGroupsWithState and disallow invalid cases
> 
>
> Key: SPARK-19858
> URL: https://issues.apache.org/jira/browse/SPARK-19858
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19413) Basic mapGroupsWithState API

2017-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19413:
-
Fix Version/s: (was: 2.1.1)

> Basic mapGroupsWithState API
> 
>
> Key: SPARK-19413
> URL: https://issues.apache.org/jira/browse/SPARK-19413
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> Basic API (without timeouts) as described in the parent JIRA SPARK-19067



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19413) Basic mapGroupsWithState API

2017-03-08 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902105#comment-15902105
 ] 

Shixiong Zhu commented on SPARK-19413:
--

Reverted the patch from branch 2.1. This feature will not go into 2.1.1.

> Basic mapGroupsWithState API
> 
>
> Key: SPARK-19413
> URL: https://issues.apache.org/jira/browse/SPARK-19413
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> Basic API (without timeouts) as described in the parent JIRA SPARK-19067



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19413) Basic mapGroupsWithState API

2017-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19413:
-
Target Version/s: 2.2.0  (was: 2.1.1, 2.2.0)

> Basic mapGroupsWithState API
> 
>
> Key: SPARK-19413
> URL: https://issues.apache.org/jira/browse/SPARK-19413
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> Basic API (without timeouts) as described in the parent JIRA SPARK-19067



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource

2017-03-08 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-19813.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> maxFilesPerTrigger combo latestFirst may miss old files in combination with 
> maxFileAge in FileStreamSource
> --
>
> Key: SPARK-19813
> URL: https://issues.apache.org/jira/browse/SPARK-19813
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1, 2.2.0
>
>
> There is a file stream source option called maxFileAge which limits how old 
> the files can be, relative the latest file that has been seen. This is used 
> to limit the files that need to be remembered as "processed". Files older 
> than the latest processed files are ignored. This values is by default 7 days.
> This causes a problem when both 
>  - latestFirst = true
>  - maxFilesPerTrigger > total files to be processed.
> Here is what happens in all combinations
>  1) latestFirst = false - Since files are processed in order, there wont be 
> any unprocessed file older than the latest processed file. All files will be 
> processed.
>  2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
> thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
> not, then all old files get processed in the first batch, and so no file is 
> left behind.
>  3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
> process the latest X files. That sets the threshold latest file - maxFileAge, 
> so files older than this threshold will never be considered for processing. 
> The bug is with case 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-08 Thread Brian Bruggeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902023#comment-15902023
 ] 

Brian Bruggeman edited comment on SPARK-19872 at 3/8/17 10:33 PM:
--

Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs.


was (Author: bbruggeman):
Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. 
 Therefore there must be a bug in the Spark 2.1.0 rdd.py file.


I attempted the naive approach of reverting both {code}def 
_load_from_socket{code} and {code}def collect{code}, but I didn't see any 
appreciable difference within the output.

> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To

[jira] [Updated] (SPARK-19540) Add ability to clone SparkSession with an identical copy of the SessionState

2017-03-08 Thread Kunal Khamar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-19540:
-
Summary: Add ability to clone SparkSession with an identical copy of the 
SessionState  (was: Add ability to clone SparkSession wherein cloned session 
has an identical copy of the SessionState)

> Add ability to clone SparkSession with an identical copy of the SessionState
> 
>
> Key: SPARK-19540
> URL: https://issues.apache.org/jira/browse/SPARK-19540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>
> Forking a newSession() from SparkSession currently makes a new SparkSession 
> that does not retain SessionState (i.e. temporary tables, SQL config, 
> registered functions etc.) This change adds a method cloneSession() which 
> creates a new SparkSession with a copy of the parent's SessionState.
> Subsequent changes to base session are not propagated to cloned session, 
> clone is independent after creation.
> If the base is changed after clone has been created, say user registers new 
> UDF, then the new UDF will not be available inside the clone. Same goes for 
> configs and temp tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has an identical copy of the SessionState

2017-03-08 Thread Kunal Khamar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-19540:
-
Summary: Add ability to clone SparkSession wherein cloned session has an 
identical copy of the SessionState  (was: Add ability to clone SparkSession 
wherein cloned session has a reference to SharedState and an identical copy of 
the SessionState)

> Add ability to clone SparkSession wherein cloned session has an identical 
> copy of the SessionState
> --
>
> Key: SPARK-19540
> URL: https://issues.apache.org/jira/browse/SPARK-19540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>
> Forking a newSession() from SparkSession currently makes a new SparkSession 
> that does not retain SessionState (i.e. temporary tables, SQL config, 
> registered functions etc.) This change adds a method cloneSession() which 
> creates a new SparkSession with a copy of the parent's SessionState.
> Subsequent changes to base session are not propagated to cloned session, 
> clone is independent after creation.
> If the base is changed after clone has been created, say user registers new 
> UDF, then the new UDF will not be available inside the clone. Same goes for 
> configs and temp tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC

2017-03-08 Thread Simon King (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon King closed SPARK-19814.
--
Resolution: Duplicate

Looks like it's wrong to characterize this as a bug -- couldn't identify an 
actual memory leak. Seems more like we'll have to wait for the major overhaul 
proposed by https://issues.apache.org/jira/browse/SPARK-18085 

> Spark History Server Out Of Memory / Extreme GC
> ---
>
> Key: SPARK-19814
> URL: https://issues.apache.org/jira/browse/SPARK-19814
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.0, 2.1.0
> Environment: Spark History Server (we've run it on several different 
> Hadoop distributions)
>Reporter: Simon King
> Attachments: SparkHistoryCPUandRAM.png
>
>
> Spark History Server runs out of memory, gets into GC thrash and eventually 
> becomes unresponsive. This seems to happen more quickly with heavy use of the 
> REST API. We've seen this with several versions of Spark. 
> Running with the following settings (spark 2.1):
> spark.history.fs.cleaner.enabledtrue
> spark.history.fs.cleaner.interval   1d
> spark.history.fs.cleaner.maxAge 7d
> spark.history.retainedApplications  500
> We will eventually get errors like:
> 17/02/25 05:02:19 WARN ServletHandler:·
> javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: 
> GC overhead limit exceeded (of class java.lang.OutOfMemoryError)
>   at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
>   at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit 
> exceeded (of class java.lang.OutOfMemoryError)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148)
>   at 
> org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110)
>   at 
> org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244)
>   at 
> org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49)
>   at 
> org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
>   at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
>   at 
>

[jira] [Comment Edited] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-08 Thread Brian Bruggeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902023#comment-15902023
 ] 

Brian Bruggeman edited comment on SPARK-19872 at 3/8/17 9:48 PM:
-

Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. 
 Therefore there must be a bug in the Spark 2.1.0 rdd.py file.


I attempted the naive approach of reverting both {code}def 
_load_from_socket{code} and {code}def collect{code}, but I didn't see any 
appreciable difference within the output.


was (Author: bbruggeman):
Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. 
 Therefore there must be a bug in the Spark 2.1.0 rdd.py file.


> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-08 Thread Brian Bruggeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902023#comment-15902023
 ] 

Brian Bruggeman edited comment on SPARK-19872 at 3/8/17 9:46 PM:
-

Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. 
 Therefore there must be a bug in the Spark 2.1.0 rdd.py file.



was (Author: bbruggeman):
Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. 
 Therefore there must be a bug in the rdd.py file somewhere.


> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-08 Thread Brian Bruggeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902023#comment-15902023
 ] 

Brian Bruggeman commented on SPARK-19872:
-

Using the Spark 2.1.0 serializers.py and the Spark 2.0.0 rdd.py, the code runs. 
 Therefore there must be a bug in the rdd.py file somewhere.


> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-08 Thread Brian Bruggeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902023#comment-15902023
 ] 

Brian Bruggeman edited comment on SPARK-19872 at 3/8/17 9:46 PM:
-

Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. 
 Therefore there must be a bug in the rdd.py file somewhere.



was (Author: bbruggeman):
Using the Spark 2.1.0 serializers.py and the Spark 2.0.0 rdd.py, the code runs. 
 Therefore there must be a bug in the rdd.py file somewhere.


> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2017-03-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-15463:
---

Assignee: Hyukjin Kwon

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

2017-03-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15463.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16854
[https://github.com/apache/spark/pull/16854]

> Support for creating a dataframe from CSV in Dataset[String]
> 
>
> Key: SPARK-15463
> URL: https://issues.apache.org/jira/browse/SPARK-15463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: PJ Fanning
> Fix For: 2.2.0
>
>
> I currently use Databrick's spark-csv lib but some features don't work with 
> Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV 
> support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in 
> RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't 
> appear to support the creation of DataFrames based on loading from 
> RDD[String].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19858:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-19067

> Add output mode to flatMapGroupsWithState and disallow invalid cases
> 
>
> Key: SPARK-19858
> URL: https://issues.apache.org/jira/browse/SPARK-19858
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases

2017-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19858:
-
Affects Version/s: 2.1.1

> Add output mode to flatMapGroupsWithState and disallow invalid cases
> 
>
> Key: SPARK-19858
> URL: https://issues.apache.org/jira/browse/SPARK-19858
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-08 Thread Brian Bruggeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901997#comment-15901997
 ] 

Brian Bruggeman commented on SPARK-19872:
-

I reverted `rdd.py` and `serializers.py` to the 2.0.2 branch in github and the 
code above works without an error.

rdd.py link: 
https://github.com/apache/spark/blob/branch-2.0/python/pyspark/rdd.py
serializers.py link: 
https://github.com/apache/spark/blob/branch-2.0/python/pyspark/serializers.py

rdd diff:
{code}
--- tmp2.py 2017-03-08 15:17:45.0 -0600
+++ saved2.py   2017-03-08 15:17:59.0 -0600
@@ -52,8 +52,6 @@
 get_used_memory, ExternalSorter, ExternalGroupBy
 from pyspark.traceback_utils import SCCallSiteSync

-from py4j.java_collections import ListConverter, MapConverter
-

 __all__ = ["RDD"]

@@ -137,11 +135,12 @@
 break
 if not sock:
 raise Exception("could not open socket")
-# The RDD materialization time is unpredicable, if we set a timeout for 
socket reading
-# operation, it will very possibly fail. See SPARK-18281.
-sock.settimeout(None)
-# The socket will be automatically closed when garbage-collected.
-return serializer.load_stream(sock.makefile("rb", 65536))
+try:
+rf = sock.makefile("rb", 65536)
+for item in serializer.load_stream(rf):
+yield item
+finally:
+sock.close()


 def ignore_unicode_prefix(f):
@@ -264,13 +263,44 @@

 def isCheckpointed(self):
 """
-Return whether this RDD has been checkpointed or not
+Return whether this RDD is checkpointed and materialized, either 
reliably or locally.
 """
 return self._jrdd.rdd().isCheckpointed()

+def localCheckpoint(self):
+"""
+Mark this RDD for local checkpointing using Spark's existing caching 
layer.
+
+This method is for users who wish to truncate RDD lineages while 
skipping the expensive
+step of replicating the materialized data in a reliable distributed 
file system. This is
+useful for RDDs with long lineages that need to be truncated 
periodically (e.g. GraphX).
+
+Local checkpointing sacrifices fault-tolerance for performance. In 
particular, checkpointed
+data is written to ephemeral local storage in the executors instead of 
to a reliable,
+fault-tolerant storage. The effect is that if an executor fails during 
the computation,
+the checkpointed data may no longer be accessible, causing an 
irrecoverable job failure.
+
+This is NOT safe to use with dynamic allocation, which removes 
executors along
+with their cached blocks. If you must use both features, you are 
advised to set
+L{spark.dynamicAllocation.cachedExecutorIdleTimeout} to a high value.
+
+The checkpoint directory set through 
L{SparkContext.setCheckpointDir()} is not used.
+"""
+self._jrdd.rdd().localCheckpoint()
+
+def isLocallyCheckpointed(self):
+"""
+Return whether this RDD is marked for local checkpointing.
+
+Exposed for testing.
+"""
+return self._jrdd.rdd().isLocallyCheckpointed()
+
 def getCheckpointFile(self):
 """
 Gets the name of the file to which this RDD was checkpointed
+
+Not defined if RDD is checkpointed locally.
 """
 checkpointFile = self._jrdd.rdd().getCheckpointFile()
 if checkpointFile.isDefined():
@@ -387,6 +417,9 @@
 with replacement: expected number of times each element is chosen; 
fraction must be >= 0
 :param seed: seed for the random number generator

+.. note:: This is not guaranteed to provide exactly the fraction 
specified of the total
+count of the given :class:`DataFrame`.
+
 >>> rdd = sc.parallelize(range(100), 4)
 >>> 6 <= rdd.sample(False, 0.1, 81).count() <= 14
 True
@@ -425,8 +458,8 @@
 """
 Return a fixed-size sampled subset of this RDD.

-Note that this method should only be used if the resulting array is 
expected
-to be small, as all the data is loaded into the driver's memory.
+.. note:: This method should only be used if the resulting array is 
expected
+to be small, as all the data is loaded into the driver's memory.

 >>> rdd = sc.parallelize(range(0, 10))
 >>> len(rdd.takeSample(True, 20, 1))
@@ -537,7 +570,7 @@
 Return the intersection of this RDD and another one. The output will
 not contain any duplicate elements, even if the input RDDs did.

-Note that this method performs a shuffle internally.
+.. note:: This method performs a shuffle internally.

 >>> rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
 >>> rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
@@ -768,8 +801,9 @@
 def collect(self):
 """
 Return a list that

[jira] [Updated] (SPARK-19873) If the user changes the number of shuffle partitions between batches, Streaming aggregation will fail.

2017-03-08 Thread Kunal Khamar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-19873:
-
Summary: If the user changes the number of shuffle partitions between 
batches, Streaming aggregation will fail.  (was: If the user changes the 
shuffle partition number between batches, Streaming aggregation will fail.)

> If the user changes the number of shuffle partitions between batches, 
> Streaming aggregation will fail.
> --
>
> Key: SPARK-19873
> URL: https://issues.apache.org/jira/browse/SPARK-19873
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kunal Khamar
>
> If the user changes the shuffle partition number between batches, Streaming 
> aggregation will fail.
> Here are some possible cases:
> - Change "spark.sql.shuffle.partitions"
> - Use "repartition" and change the partition number in codes
> - RangePartitioner doesn't generate deterministic partitions. Right now it's 
> safe as we disallow sort before aggregation. Not sure if we will add some 
> operators using RangePartitioner in future.
> Fix:
> Record # shuffle partitions in offset log and enforce in next batch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19873) If the user changes the shuffle partition number between batches, Streaming aggregation will fail.

2017-03-08 Thread Kunal Khamar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-19873:
-
Description: 
If the user changes the shuffle partition number between batches, Streaming 
aggregation will fail.

Here are some possible cases:

- Change "spark.sql.shuffle.partitions"
- Use "repartition" and change the partition number in codes
- RangePartitioner doesn't generate deterministic partitions. Right now it's 
safe as we disallow sort before aggregation. Not sure if we will add some 
operators using RangePartitioner in future.

Fix:
Record # shuffle partition in offset log and enforce in next batch

  was:
It the user changes the shuffle partition number between batches, Streaming 
aggregation will fail.

Here are some possible cases:

- Change "spark.sql.shuffle.partitions"
- Use "repartition" and change the partition number in codes
- RangePartitioner doesn't generate deterministic partitions. Right now it's 
safe as we disallow sort before aggregation. Not sure if we will add some 
operators using RangePartitioner in future.

Fix:
Record # shuffle partition in offset log and enforce in next batch


> If the user changes the shuffle partition number between batches, Streaming 
> aggregation will fail.
> --
>
> Key: SPARK-19873
> URL: https://issues.apache.org/jira/browse/SPARK-19873
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kunal Khamar
>
> If the user changes the shuffle partition number between batches, Streaming 
> aggregation will fail.
> Here are some possible cases:
> - Change "spark.sql.shuffle.partitions"
> - Use "repartition" and change the partition number in codes
> - RangePartitioner doesn't generate deterministic partitions. Right now it's 
> safe as we disallow sort before aggregation. Not sure if we will add some 
> operators using RangePartitioner in future.
> Fix:
> Record # shuffle partition in offset log and enforce in next batch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19873) If the user changes the shuffle partition number between batches, Streaming aggregation will fail.

2017-03-08 Thread Kunal Khamar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-19873:
-
Description: 
If the user changes the shuffle partition number between batches, Streaming 
aggregation will fail.

Here are some possible cases:

- Change "spark.sql.shuffle.partitions"
- Use "repartition" and change the partition number in codes
- RangePartitioner doesn't generate deterministic partitions. Right now it's 
safe as we disallow sort before aggregation. Not sure if we will add some 
operators using RangePartitioner in future.

Fix:
Record # shuffle partitions in offset log and enforce in next batch

  was:
If the user changes the shuffle partition number between batches, Streaming 
aggregation will fail.

Here are some possible cases:

- Change "spark.sql.shuffle.partitions"
- Use "repartition" and change the partition number in codes
- RangePartitioner doesn't generate deterministic partitions. Right now it's 
safe as we disallow sort before aggregation. Not sure if we will add some 
operators using RangePartitioner in future.

Fix:
Record # shuffle partition in offset log and enforce in next batch


> If the user changes the shuffle partition number between batches, Streaming 
> aggregation will fail.
> --
>
> Key: SPARK-19873
> URL: https://issues.apache.org/jira/browse/SPARK-19873
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kunal Khamar
>
> If the user changes the shuffle partition number between batches, Streaming 
> aggregation will fail.
> Here are some possible cases:
> - Change "spark.sql.shuffle.partitions"
> - Use "repartition" and change the partition number in codes
> - RangePartitioner doesn't generate deterministic partitions. Right now it's 
> safe as we disallow sort before aggregation. Not sure if we will add some 
> operators using RangePartitioner in future.
> Fix:
> Record # shuffle partitions in offset log and enforce in next batch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19873) If the user changes the shuffle partition number between batches, Streaming aggregation will fail.

2017-03-08 Thread Kunal Khamar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-19873:
-
Description: 
It the user changes the shuffle partition number between batches, Streaming 
aggregation will fail.

Here are some possible cases:

- Change "spark.sql.shuffle.partitions"
- Use "repartition" and change the partition number in codes
- RangePartitioner doesn't generate deterministic partitions. Right now it's 
safe as we disallow sort before aggregation. Not sure if we will add some 
operators using RangePartitioner in future.

Fix:
Record # shuffle partition in offset log and enforce in next batch

> If the user changes the shuffle partition number between batches, Streaming 
> aggregation will fail.
> --
>
> Key: SPARK-19873
> URL: https://issues.apache.org/jira/browse/SPARK-19873
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Kunal Khamar
>
> It the user changes the shuffle partition number between batches, Streaming 
> aggregation will fail.
> Here are some possible cases:
> - Change "spark.sql.shuffle.partitions"
> - Use "repartition" and change the partition number in codes
> - RangePartitioner doesn't generate deterministic partitions. Right now it's 
> safe as we disallow sort before aggregation. Not sure if we will add some 
> operators using RangePartitioner in future.
> Fix:
> Record # shuffle partition in offset log and enforce in next batch



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19873) If the user changes the shuffle partition number between batches, Streaming aggregation will fail.

2017-03-08 Thread Kunal Khamar (JIRA)

Kunal Khamar created SPARK-19873:


 Summary: If the user changes the shuffle partition number between 
batches, Streaming aggregation will fail.
 Key: SPARK-19873
 URL: https://issues.apache.org/jira/browse/SPARK-19873
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Kunal Khamar






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-08 Thread Brian Bruggeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901974#comment-15901974
 ] 

Brian Bruggeman commented on SPARK-19872:
-

This is a regression from spark 2.0.x.

> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner

2017-03-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19481:
-
Fix Version/s: 2.0.3

> Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in 
> ClosureCleaner
> -
>
> Key: SPARK-19481
> URL: https://issues.apache.org/jira/browse/SPARK-19481
> Project: Spark
>  Issue Type: Test
>  Components: Spark Shell
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the 
> tests unstable. See:
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-08 Thread Brian Bruggeman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Bruggeman updated SPARK-19872:

Description: 
I'm receiving the following traceback:

{code}
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
{code}

I created a textfile (text.txt) with standard linux newlines:
{code}
a
b

d
e
f
g
h
i
j
k
l

{code}

I think ran pyspark:
{code}
$ pyspark
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
SparkSession available as 'spark'.
>>> sc.textFile('test.txt').collect()
[u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
>>> sc.textFile('test.txt', use_unicode=False).collect()
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
>>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
'\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
{code}

This really looks like a bug in the `serializers.py` code.

  was:
I'm receiving the following traceback:

{{
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
}}

I created a textfile (text.txt) with standard linux newlines:
{{
a
b

d
e
f
g
h
i
j
k
l

}}

I think ran pyspark:
{{
$ pyspark
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on

[jira] [Updated] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-08 Thread Brian Bruggeman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Bruggeman updated SPARK-19872:

Description: 
I'm receiving the following traceback:

{{
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
}}

I created a textfile (text.txt) with standard linux newlines:
{{
a
b

d
e
f
g
h
i
j
k
l

}}

I think ran pyspark:
{{
$ pyspark
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
SparkSession available as 'spark'.
>>> sc.textFile('test.txt').collect()
[u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
>>> sc.textFile('test.txt', use_unicode=False).collect()
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
>>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
'\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
}}

This really looks like a bug in the `serializers.py` code.

  was:
I'm receiving the following traceback:

```
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
```

I created a textfile (text.txt) with standard linux newlines:
```
a
b

d
e
f
g
h
i
j
k
l

```

I think ran pyspark:
```
$ pyspark
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help",

[jira] [Created] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-08 Thread Brian Bruggeman (JIRA)

Brian Bruggeman created SPARK-19872:
---

 Summary: UnicodeDecodeError in Pyspark on sc.textFile read with 
repartition
 Key: SPARK-19872
 URL: https://issues.apache.org/jira/browse/SPARK-19872
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
 Environment: Mac and EC2
Reporter: Brian Bruggeman


I'm receiving the following traceback:

```
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
```

I created a textfile (text.txt) with standard linux newlines:
```
a
b

d
e
f
g
h
i
j
k
l

```

I think ran pyspark:
```
$ pyspark
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
SparkSession available as 'spark'.
>>> sc.textFile('test.txt').collect()
[u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
>>> sc.textFile('test.txt', use_unicode=False).collect()
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
>>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
'\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
```

This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19727) Spark SQL round function modifies original column

2017-03-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19727:
---

Assignee: Wojciech Szymanski

> Spark SQL round function modifies original column
> -
>
> Key: SPARK-19727
> URL: https://issues.apache.org/jira/browse/SPARK-19727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sławomir Bogutyn
>Assignee: Wojciech Szymanski
>Priority: Minor
> Fix For: 2.2.0
>
>
> {code:java}
> import org.apache.spark.sql.functions
> case class MyRow(value : BigDecimal)
> val values = List(MyRow(BigDecimal.valueOf(1.23456789)))
> val dataFrame = spark.createDataFrame(values)
> dataFrame.show()
> dataFrame.withColumn("value_rounded", 
> functions.round(dataFrame.col("value"))).show()
> {code}
> This produces output:
> {noformat}
> ++
> |   value|
> ++
> |1.2345678900|
> ++
> ++-+
> |   value|value_rounded|
> ++-+
> |1.00|1|
> ++-+
> {noformat}
> Same problem occurs when I use round function to filter dataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19727) Spark SQL round function modifies original column

2017-03-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19727.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17075
[https://github.com/apache/spark/pull/17075]

> Spark SQL round function modifies original column
> -
>
> Key: SPARK-19727
> URL: https://issues.apache.org/jira/browse/SPARK-19727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sławomir Bogutyn
>Priority: Minor
> Fix For: 2.2.0
>
>
> {code:java}
> import org.apache.spark.sql.functions
> case class MyRow(value : BigDecimal)
> val values = List(MyRow(BigDecimal.valueOf(1.23456789)))
> val dataFrame = spark.createDataFrame(values)
> dataFrame.show()
> dataFrame.withColumn("value_rounded", 
> functions.round(dataFrame.col("value"))).show()
> {code}
> This produces output:
> {noformat}
> ++
> |   value|
> ++
> |1.2345678900|
> ++
> ++-+
> |   value|value_rounded|
> ++-+
> |1.00|1|
> ++-+
> {noformat}
> Same problem occurs when I use round function to filter dataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18355) Spark SQL fails to read data from a ORC hive table that has a new column added to it

2017-03-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18355:
--
Affects Version/s: 2.1.0

> Spark SQL fails to read data from a ORC hive table that has a new column 
> added to it
> 
>
> Key: SPARK-18355
> URL: https://issues.apache.org/jira/browse/SPARK-18355
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.1.0
> Environment: Centos6
>Reporter: Sandeep Nemuri
>
> *PROBLEM*:
> Spark SQL fails to read data from a ORC hive table that has a new column 
> added to it.
> Below is the exception:
> {code}
> scala> sqlContext.sql("select click_id,search_id from testorc").show
> 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select 
> click_id,search_id from testorc
> 16/11/03 22:17:54 INFO ParseDriver: Parse Completed
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> *STEPS TO SIMULATE THIS ISSUE*:
> 1) Create table in hive.
> {code}
> CREATE TABLE `testorc`( 
> `click_id` string, 
> `search_id` string, 
> `uid` bigint)
> PARTITIONED BY ( 
> `ts` string, 
> `hour` string) 
> STORED AS ORC; 
> {code}
> 2) Load data into table :
> {code}
> INSERT INTO TABLE testorc PARTITION (ts = '98765',hour = '01' ) VALUES 
> (12,2,12345);
> {code}
> 3) Select through spark shell (This works)
> {code}
> sqlContext.sql("select click_id,search_id from testorc").show
> {code}
> 4) Now add column to hive table
> {code}
> ALTER TABLE testorc ADD COLUMNS (dummy string);
> {code}
> 5) Now again select from spark shell
> {code}
> scala> sqlContext.sql("select click_id,search_id from testorc").show
> 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select 
> click_id,search_id from testorc
> 16/11/03 22:17:54 INFO ParseDriver: Parse Completed
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
>

[jira] [Commented] (SPARK-18355) Spark SQL fails to read data from a ORC hive table that has a new column added to it

2017-03-08 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901914#comment-15901914
 ] 

Dongjoon Hyun commented on SPARK-18355:
---

I confirmed that this happens only with 
`spark.sql.hive.convertMetastoreOrc=true`.
{code}
scala> spark.version
res0: String = 2.2.0-SNAPSHOT

scala> sql("select click_id, search_id from testorc").show
++-+
|click_id|search_id|
++-+
|  12|2|
++-+

scala> sql("set spark.sql.hive.convertMetastoreOrc=true").show(false)
+--+-+
|key   |value|
+--+-+
|spark.sql.hive.convertMetastoreOrc|true |
+--+-+

scala> sql("select click_id, search_id from testorc").show
17/03/08 12:15:43 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.IllegalArgumentException: Field "click_id" does not exist.
{code}

> Spark SQL fails to read data from a ORC hive table that has a new column 
> added to it
> 
>
> Key: SPARK-18355
> URL: https://issues.apache.org/jira/browse/SPARK-18355
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
> Environment: Centos6
>Reporter: Sandeep Nemuri
>
> *PROBLEM*:
> Spark SQL fails to read data from a ORC hive table that has a new column 
> added to it.
> Below is the exception:
> {code}
> scala> sqlContext.sql("select click_id,search_id from testorc").show
> 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select 
> click_id,search_id from testorc
> 16/11/03 22:17:54 INFO ParseDriver: Parse Completed
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> *STEPS TO SIMULATE THIS ISSUE*:
> 1) Create table in hive.
> {code}
> CREATE TABLE `testorc`( 
> `click_id` string, 
> `search_id` string, 
> `uid` bigint)
> PARTITIONED BY ( 
> `ts` string, 
> `hour` string) 
> STORED AS ORC; 
> {code}
> 2) Load data into table :
> {code}
> INSERT INTO TABLE testorc PARTITION (ts = '98765',hour = '01' ) VALUES 
> (12,2,12345);
> {code}
> 3) Select through spark shell (This works)
> {code}
> sqlContext.sql("select click_id,search_id from testorc").show
> {code}
> 4) Now add column to hive table
> {code}
> ALTER TABLE testorc ADD COLUMNS (dummy string);
> {code}
> 5) Now again select from spark shell
> {code}
> scala> sqlContext.sql("select click_id,search_id from testorc").show
> 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select 
> click_id,search_id from testorc
> 16/11/03 22:17:54 INFO ParseDriver: Parse Completed
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38)
>   at scala.Option.map(Option.scala:145)
>   at 
>

[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-08 Thread Owen O'Malley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901769#comment-15901769
 ] 

Owen O'Malley commented on SPARK-15474:
---

Ok, Hive's use is fine because it gets the schema from the metastore and only 
matters for schema evolution, which isn't relevant if there are no rows.

In fact, it gets worse in newer versions of Hive where the OrcOutputFormat will 
write 0 byte files and OrcInputFormat will ignore 0 bytes files for reading. 
(The reason behind needing the files at all are an interesting bit of Hive 
history, but not relevant for this.)

The real fix is that Spark needs to use OrcFile.createWriter(...) API to write 
the files rather than Hive's OrcOutputFormat. The OrcFile API lets the caller 
set the schema directly.

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19864) add makeQualifiedPath in SQLTestUtils to optimize some code

2017-03-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19864.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17204
[https://github.com/apache/spark/pull/17204]

> add makeQualifiedPath in SQLTestUtils to optimize some code
> ---
>
> Key: SPARK-19864
> URL: https://issues.apache.org/jira/browse/SPARK-19864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently there are lots of places to make the path qualified, it is better 
> to provide a function to do this, then the code will be more simple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19864) add makeQualifiedPath in SQLTestUtils to optimize some code

2017-03-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19864:
---

Assignee: Song Jun

> add makeQualifiedPath in SQLTestUtils to optimize some code
> ---
>
> Key: SPARK-19864
> URL: https://issues.apache.org/jira/browse/SPARK-19864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Song Jun
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently there are lots of places to make the path qualified, it is better 
> to provide a function to do this, then the code will be more simple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18209) More robust view canonicalization without full SQL expansion

2017-03-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18209.
-
Resolution: Fixed

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.
> Update 1: based on the discussion below, we don't even need to put the view 
> definition in a sub query. We can just add it via a logical plan at the end.
> Update 2: we should make sure permanent views do not depend on temporary 
> objects (views, tables, or functions).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19871) Improve error message in verify_type to indicate which field the error is for

2017-03-08 Thread Len Frodgers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901668#comment-15901668
 ] 

Len Frodgers commented on SPARK-19871:
--

https://github.com/apache/spark/pull/17213

> Improve error message in verify_type to indicate which field the error is for
> -
>
> Key: SPARK-19871
> URL: https://issues.apache.org/jira/browse/SPARK-19871
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Len Frodgers
>
> The error message for _verify_type is too vague. It should specify for which 
> field the type verification failed.
> e.g. error: "This field is not nullable, but got None" – but what is the 
> field?!
> In a dataframe with many non-nullable fields, it is a nightmare trying to 
> hunt down which field is null, since one has to resort to basic (and very 
> slow) trial and error.
> I have happily created a PR to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13740) add null check for _verify_type in types.py

2017-03-08 Thread Len Frodgers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901667#comment-15901667
 ] 

Len Frodgers commented on SPARK-13740:
--

https://github.com/apache/spark/pull/17213

> add null check for _verify_type in types.py
> ---
>
> Key: SPARK-13740
> URL: https://issues.apache.org/jira/browse/SPARK-13740
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19871) Improve error message in verify_type to indicate which field the error is for

2017-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19871:


Assignee: Apache Spark

> Improve error message in verify_type to indicate which field the error is for
> -
>
> Key: SPARK-19871
> URL: https://issues.apache.org/jira/browse/SPARK-19871
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Len Frodgers
>Assignee: Apache Spark
>
> The error message for _verify_type is too vague. It should specify for which 
> field the type verification failed.
> e.g. error: "This field is not nullable, but got None" – but what is the 
> field?!
> In a dataframe with many non-nullable fields, it is a nightmare trying to 
> hunt down which field is null, since one has to resort to basic (and very 
> slow) trial and error.
> I have happily created a PR to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 149 matches

Mail list logo