[jira] [Updated] (SPARK-19881) Support Dynamic Partition Inserts params with SET command
[ https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19881: -- Description: Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In most case, Spark handles well. However, for the dynamic partition insert, users meet the following misleading situation. {code} scala> spark.range(1001).selectExpr("id as key", "id as value").registerTempTable("t1001") scala> sql("create table p (value int) partitioned by (key int)").show scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict scala> sql("set hive.exec.dynamic.partition.mode=nonstrict") scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001. scala> sql("set hive.exec.max.dynamic.partitions=1001") scala> sql("set hive.exec.max.dynamic.partitions").show(false) ++-+ |key |value| ++-+ |hive.exec.max.dynamic.partitions|1001 | ++-+ scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001. {code} The last error is the same with the previous one. `HiveClient` does not know new value 1001. There is no way to change the default value of `hive.exec.max.dynamic.partitions` of `HiveCilent` with `SET` command. The root cause is that `hive` parameters are passed to `HiveClient` on creating. So, the workaround is to use `--hiveconf` when starting `spark-shell`. However, it is still unchangeable in `spark-shell`. We had better handle this case without misleading error messages ending infinite loop. was: ## What changes were proposed in this pull request? Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In most case, Spark handles well. However, for the dynamic partition insert, users meet the following misleading situation. {code} scala> spark.range(1001).selectExpr("id as key", "id as value").registerTempTable("t1001") scala> sql("create table p (value int) partitioned by (key int)").show scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict scala> sql("set hive.exec.dynamic.partition.mode=nonstrict") scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001. scala> sql("set hive.exec.max.dynamic.partitions=1001") scala> sql("set hive.exec.max.dynamic.partitions").show(false) ++-+ |key |value| ++-+ |hive.exec.max.dynamic.partitions|1001 | ++-+ scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001. {code} The last error is the same with the previous one. `HiveClient` does not know new value 1001. There is no way to change the default value of `hive.exec.max.dynamic.partitions` of `HiveCilent` with `SET` command. The root cause is that `hive` parameters are passed to `HiveClient` on creating. So, the workaround is to use `--hiveconf` when starting `spark-shell`. However, it is still unchangeable in `spark-shell`. We had better handle this case without misleading error messages ending infinite loop. > Support Dynamic Partition Inserts params with SET command > - > > Key: SPARK-19881 > URL: https://issues.apache.org/jira/browse/SPARK-19881 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In > most case, Spark handles well. However, for the dynamic
[jira] [Updated] (SPARK-19881) Support Dynamic Partition Inserts params with SET command
[ https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19881: -- Description: ## What changes were proposed in this pull request? Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In most case, Spark handles well. However, for the dynamic partition insert, users meet the following misleading situation. {code} scala> spark.range(1001).selectExpr("id as key", "id as value").registerTempTable("t1001") scala> sql("create table p (value int) partitioned by (key int)").show scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict scala> sql("set hive.exec.dynamic.partition.mode=nonstrict") scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001. scala> sql("set hive.exec.max.dynamic.partitions=1001") scala> sql("set hive.exec.max.dynamic.partitions").show(false) ++-+ |key |value| ++-+ |hive.exec.max.dynamic.partitions|1001 | ++-+ scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001. {code} The last error is the same with the previous one. `HiveClient` does not know new value 1001. There is no way to change the default value of `hive.exec.max.dynamic.partitions` of `HiveCilent` with `SET` command. The root cause is that `hive` parameters are passed to `HiveClient` on creating. So, the workaround is to use `--hiveconf` when starting `spark-shell`. However, it is still unchangeable in `spark-shell`. We had better handle this case without misleading error messages ending infinite loop. was: Currently, `SET` command does not pass the values to Hive. In most case, Spark handles well. However, for the dynamic partition insert, users meet the following situation. {code} scala> spark.range(1001).selectExpr("id as key", "id as value").registerTempTable("t1001") scala> sql("create table p (value int) partitioned by (key int)").show scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict scala> sql("set hive.exec.dynamic.partition.mode=nonstrict") scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001. scala> sql("set hive.exec.dynamic.partition.mode=1001") scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001. <== Repeat the same error message. {code} The root cause is that `hive` parameters are passed to `HiveClient` on creating. So, The workaround is using `--hiveconf`. We had better handle this case without misleading error messages. > Support Dynamic Partition Inserts params with SET command > - > > Key: SPARK-19881 > URL: https://issues.apache.org/jira/browse/SPARK-19881 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > > ## What changes were proposed in this pull request? > Since Spark 2.0.0, `SET` commands do not pass the values to HiveClient. In > most case, Spark handles well. However, for the dynamic partition insert, > users meet the following misleading situation. > {code} > scala> spark.range(1001).selectExpr("id as key", "id as > value").registerTempTable("t1001") > scala> sql("create table p (value int) partitioned by (key int)").show > scala> sql("insert into table p partition(key) select key, value from t1001") > org.apache.spark.SparkException: > Dynamic partition strict mode requires at least one static partition column. > To turn this off set hive.exec.dynamic.partition.mode=nonstrict > scala>
[jira] [Assigned] (SPARK-19881) Support Dynamic Partition Inserts params with SET command
[ https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19881: Assignee: (was: Apache Spark) > Support Dynamic Partition Inserts params with SET command > - > > Key: SPARK-19881 > URL: https://issues.apache.org/jira/browse/SPARK-19881 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Currently, `SET` command does not pass the values to Hive. In most case, > Spark handles well. However, for the dynamic partition insert, users meet the > following situation. > {code} > scala> spark.range(1001).selectExpr("id as key", "id as > value").registerTempTable("t1001") > scala> sql("create table p (value int) partitioned by (key int)").show > scala> sql("insert into table p partition(key) select key, value from t1001") > org.apache.spark.SparkException: Dynamic partition strict mode requires at > least one static partition column. To turn this off set > hive.exec.dynamic.partition.mode=nonstrict > scala> sql("set hive.exec.dynamic.partition.mode=nonstrict") > scala> sql("insert into table p partition(key) select key, value from t1001") > org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic > partitions created is 1001, which is more than 1000. To solve this try to set > hive.exec.max.dynamic.partitions to at least 1001. > scala> sql("set hive.exec.dynamic.partition.mode=1001") > scala> sql("insert into table p partition(key) select key, value from t1001") > org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic > partitions created is 1001, which is more than 1000. To solve this try to set > hive.exec.max.dynamic.partitions to at least 1001. > <== Repeat the same error message. > {code} > The root cause is that `hive` parameters are passed to `HiveClient` on > creating. So, The workaround is using `--hiveconf`. > We had better handle this case without misleading error messages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19881) Support Dynamic Partition Inserts params with SET command
[ https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902614#comment-15902614 ] Apache Spark commented on SPARK-19881: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/17223 > Support Dynamic Partition Inserts params with SET command > - > > Key: SPARK-19881 > URL: https://issues.apache.org/jira/browse/SPARK-19881 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Currently, `SET` command does not pass the values to Hive. In most case, > Spark handles well. However, for the dynamic partition insert, users meet the > following situation. > {code} > scala> spark.range(1001).selectExpr("id as key", "id as > value").registerTempTable("t1001") > scala> sql("create table p (value int) partitioned by (key int)").show > scala> sql("insert into table p partition(key) select key, value from t1001") > org.apache.spark.SparkException: Dynamic partition strict mode requires at > least one static partition column. To turn this off set > hive.exec.dynamic.partition.mode=nonstrict > scala> sql("set hive.exec.dynamic.partition.mode=nonstrict") > scala> sql("insert into table p partition(key) select key, value from t1001") > org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic > partitions created is 1001, which is more than 1000. To solve this try to set > hive.exec.max.dynamic.partitions to at least 1001. > scala> sql("set hive.exec.dynamic.partition.mode=1001") > scala> sql("insert into table p partition(key) select key, value from t1001") > org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic > partitions created is 1001, which is more than 1000. To solve this try to set > hive.exec.max.dynamic.partitions to at least 1001. > <== Repeat the same error message. > {code} > The root cause is that `hive` parameters are passed to `HiveClient` on > creating. So, The workaround is using `--hiveconf`. > We had better handle this case without misleading error messages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19881) Support Dynamic Partition Inserts params with SET command
[ https://issues.apache.org/jira/browse/SPARK-19881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19881: Assignee: Apache Spark > Support Dynamic Partition Inserts params with SET command > - > > Key: SPARK-19881 > URL: https://issues.apache.org/jira/browse/SPARK-19881 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > Currently, `SET` command does not pass the values to Hive. In most case, > Spark handles well. However, for the dynamic partition insert, users meet the > following situation. > {code} > scala> spark.range(1001).selectExpr("id as key", "id as > value").registerTempTable("t1001") > scala> sql("create table p (value int) partitioned by (key int)").show > scala> sql("insert into table p partition(key) select key, value from t1001") > org.apache.spark.SparkException: Dynamic partition strict mode requires at > least one static partition column. To turn this off set > hive.exec.dynamic.partition.mode=nonstrict > scala> sql("set hive.exec.dynamic.partition.mode=nonstrict") > scala> sql("insert into table p partition(key) select key, value from t1001") > org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic > partitions created is 1001, which is more than 1000. To solve this try to set > hive.exec.max.dynamic.partitions to at least 1001. > scala> sql("set hive.exec.dynamic.partition.mode=1001") > scala> sql("insert into table p partition(key) select key, value from t1001") > org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic > partitions created is 1001, which is more than 1000. To solve this try to set > hive.exec.max.dynamic.partitions to at least 1001. > <== Repeat the same error message. > {code} > The root cause is that `hive` parameters are passed to `HiveClient` on > creating. So, The workaround is using `--hiveconf`. > We had better handle this case without misleading error messages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19881) Support Dynamic Partition Inserts params with SET command
Dongjoon Hyun created SPARK-19881: - Summary: Support Dynamic Partition Inserts params with SET command Key: SPARK-19881 URL: https://issues.apache.org/jira/browse/SPARK-19881 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.0 Reporter: Dongjoon Hyun Priority: Minor Currently, `SET` command does not pass the values to Hive. In most case, Spark handles well. However, for the dynamic partition insert, users meet the following situation. {code} scala> spark.range(1001).selectExpr("id as key", "id as value").registerTempTable("t1001") scala> sql("create table p (value int) partitioned by (key int)").show scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict scala> sql("set hive.exec.dynamic.partition.mode=nonstrict") scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001. scala> sql("set hive.exec.dynamic.partition.mode=1001") scala> sql("insert into table p partition(key) select key, value from t1001") org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1001, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1001. <== Repeat the same error message. {code} The root cause is that `hive` parameters are passed to `HiveClient` on creating. So, The workaround is using `--hiveconf`. We had better handle this case without misleading error messages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19439) PySpark's registerJavaFunction Should Support UDAFs
[ https://issues.apache.org/jira/browse/SPARK-19439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902601#comment-15902601 ] Apache Spark commented on SPARK-19439: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/17222 > PySpark's registerJavaFunction Should Support UDAFs > --- > > Key: SPARK-19439 > URL: https://issues.apache.org/jira/browse/SPARK-19439 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Keith Bourgoin > > When trying to import a Scala UDAF using registerJavaFunction, I get this > error: > {code} > In [1]: sqlContext.registerJavaFunction('geo_mean', > 'com.foo.bar.GeometricMean') > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 sqlContext.registerJavaFunction('geo_mean', > 'com.foo.bar.GeometricMean') > /home/kfb/src/projects/spark/python/pyspark/sql/context.pyc in > registerJavaFunction(self, name, javaClassName, returnType) > 227 if returnType is not None: > 228 jdt = > self.sparkSession._jsparkSession.parseDataType(returnType.json()) > --> 229 self.sparkSession._jsparkSession.udf().registerJava(name, > javaClassName, jdt) > 230 > 231 # TODO(andrew): delete this once we refactor things to take in > SparkSession > /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /home/kfb/src/projects/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o28.registerJava. > : java.io.IOException: UDF class com.foo.bar.GeometricMean doesn't implement > any UDF interface > at > org.apache.spark.sql.UDFRegistration.registerJava(UDFRegistration.scala:438) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {code} > According to SPARK-10915, UDAFs in Python aren't happening anytime soon. > Without this, there's no way to get Scala UDAFs into Python Spark SQL > whatsoever. Fixing that would be a huge help so that we can keep aggregations > in the JVM and using DataFrames. Otherwise, all our code has to drop to to > RDDs and live in Python. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19439) PySpark's registerJavaFunction Should Support UDAFs
[ https://issues.apache.org/jira/browse/SPARK-19439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19439: Assignee: Apache Spark > PySpark's registerJavaFunction Should Support UDAFs > --- > > Key: SPARK-19439 > URL: https://issues.apache.org/jira/browse/SPARK-19439 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Keith Bourgoin >Assignee: Apache Spark > > When trying to import a Scala UDAF using registerJavaFunction, I get this > error: > {code} > In [1]: sqlContext.registerJavaFunction('geo_mean', > 'com.foo.bar.GeometricMean') > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 sqlContext.registerJavaFunction('geo_mean', > 'com.foo.bar.GeometricMean') > /home/kfb/src/projects/spark/python/pyspark/sql/context.pyc in > registerJavaFunction(self, name, javaClassName, returnType) > 227 if returnType is not None: > 228 jdt = > self.sparkSession._jsparkSession.parseDataType(returnType.json()) > --> 229 self.sparkSession._jsparkSession.udf().registerJava(name, > javaClassName, jdt) > 230 > 231 # TODO(andrew): delete this once we refactor things to take in > SparkSession > /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /home/kfb/src/projects/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o28.registerJava. > : java.io.IOException: UDF class com.foo.bar.GeometricMean doesn't implement > any UDF interface > at > org.apache.spark.sql.UDFRegistration.registerJava(UDFRegistration.scala:438) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {code} > According to SPARK-10915, UDAFs in Python aren't happening anytime soon. > Without this, there's no way to get Scala UDAFs into Python Spark SQL > whatsoever. Fixing that would be a huge help so that we can keep aggregations > in the JVM and using DataFrames. Otherwise, all our code has to drop to to > RDDs and live in Python. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19439) PySpark's registerJavaFunction Should Support UDAFs
[ https://issues.apache.org/jira/browse/SPARK-19439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19439: Assignee: (was: Apache Spark) > PySpark's registerJavaFunction Should Support UDAFs > --- > > Key: SPARK-19439 > URL: https://issues.apache.org/jira/browse/SPARK-19439 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Keith Bourgoin > > When trying to import a Scala UDAF using registerJavaFunction, I get this > error: > {code} > In [1]: sqlContext.registerJavaFunction('geo_mean', > 'com.foo.bar.GeometricMean') > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 sqlContext.registerJavaFunction('geo_mean', > 'com.foo.bar.GeometricMean') > /home/kfb/src/projects/spark/python/pyspark/sql/context.pyc in > registerJavaFunction(self, name, javaClassName, returnType) > 227 if returnType is not None: > 228 jdt = > self.sparkSession._jsparkSession.parseDataType(returnType.json()) > --> 229 self.sparkSession._jsparkSession.udf().registerJava(name, > javaClassName, jdt) > 230 > 231 # TODO(andrew): delete this once we refactor things to take in > SparkSession > /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) >1134 >1135 for temp_arg in temp_args: > /home/kfb/src/projects/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /home/kfb/src/projects/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o28.registerJava. > : java.io.IOException: UDF class com.foo.bar.GeometricMean doesn't implement > any UDF interface > at > org.apache.spark.sql.UDFRegistration.registerJava(UDFRegistration.scala:438) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {code} > According to SPARK-10915, UDAFs in Python aren't happening anytime soon. > Without this, there's no way to get Scala UDAFs into Python Spark SQL > whatsoever. Fixing that would be a huge help so that we can keep aggregations > in the JVM and using DataFrames. Otherwise, all our code has to drop to to > RDDs and live in Python. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"
[ https://issues.apache.org/jira/browse/SPARK-19874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19874. -- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > Hide API docs for "org.apache.spark.sql.internal" > - > > Key: SPARK-19874 > URL: https://issues.apache.org/jira/browse/SPARK-19874 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > The API docs should not include the "org.apache.spark.sql.internal" package > because they are internal private APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"
[ https://issues.apache.org/jira/browse/SPARK-19874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19874: - Priority: Minor (was: Major) > Hide API docs for "org.apache.spark.sql.internal" > - > > Key: SPARK-19874 > URL: https://issues.apache.org/jira/browse/SPARK-19874 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.1.1, 2.2.0 > > > The API docs should not include the "org.apache.spark.sql.internal" package > because they are internal private APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19235) Enable Test Cases in DDLSuite with Hive Metastore
[ https://issues.apache.org/jira/browse/SPARK-19235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19235. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16592 [https://github.com/apache/spark/pull/16592] > Enable Test Cases in DDLSuite with Hive Metastore > - > > Key: SPARK-19235 > URL: https://issues.apache.org/jira/browse/SPARK-19235 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > So far, the test cases in DDLSuites only verify the behaviors of > InMemoryCatalog. That means, they do not cover the scenarios using > HiveExternalCatalog. Thus, we need to improve the existing test suite to run > these cases using Hive metastore. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
[ https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902537#comment-15902537 ] Jim Kleckner edited comment on SPARK-11141 at 3/9/17 5:54 AM: -- FYI, this can cause problems when not using S3 during shutdown as described in this AWS posting: https://forums.aws.amazon.com/thread.jspa?threadID=223378 The workaround indicated is to use --conf spark.streaming.driver.writeAheadLog.allowBatching=false with the submit. The exception contains the text: {code} streaming stop ReceivedBlockTracker: Exception thrown while writing record: BatchAllocationEvent {code} was (Author: jkleckner): FYI, this can cause problems when not using S3 during shutdown as described in this AWS posting: https://forums.aws.amazon.com/thread.jspa?threadID=223378 The workaround indicated is to use --conf spark.streaming.driver.writeAheadLog.allowBatching=false with the submit. > Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes > -- > > Key: SPARK-11141 > URL: https://issues.apache.org/jira/browse/SPARK-11141 > Project: Spark > Issue Type: Improvement > Components: DStreams >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 1.6.0 > > > When using S3 as a directory for WALs, the writes take too long. The driver > gets very easily bottlenecked when multiple receivers send AddBlock events to > the ReceiverTracker. This PR adds batching of events in the > ReceivedBlockTracker so that receivers don't get blocked by the driver for > too long. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11141) Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes
[ https://issues.apache.org/jira/browse/SPARK-11141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902537#comment-15902537 ] Jim Kleckner commented on SPARK-11141: -- FYI, this can cause problems when not using S3 during shutdown as described in this AWS posting: https://forums.aws.amazon.com/thread.jspa?threadID=223378 The workaround indicated is to use --conf spark.streaming.driver.writeAheadLog.allowBatching=false with the submit. > Batching of ReceivedBlockTrackerLogEvents for efficient WAL writes > -- > > Key: SPARK-11141 > URL: https://issues.apache.org/jira/browse/SPARK-11141 > Project: Spark > Issue Type: Improvement > Components: DStreams >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 1.6.0 > > > When using S3 as a directory for WALs, the writes take too long. The driver > gets very easily bottlenecked when multiple receivers send AddBlock events to > the ReceiverTracker. This PR adds batching of events in the > ReceivedBlockTracker so that receivers don't get blocked by the driver for > too long. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19866) Add local version of Word2Vec findSynonyms for spark.ml: Python API
[ https://issues.apache.org/jira/browse/SPARK-19866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902522#comment-15902522 ] Xin Ren commented on SPARK-19866: - I can try this one :) > Add local version of Word2Vec findSynonyms for spark.ml: Python API > --- > > Key: SPARK-19866 > URL: https://issues.apache.org/jira/browse/SPARK-19866 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Add Python API for findSynonymsArray matching Scala API in linked JIRA. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19880) About spark2.0.2 and spark1.4.1 beeline to show the database, use the default operation such as dealing with different
[ https://issues.apache.org/jira/browse/SPARK-19880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolong updated SPARK-19880: Description: About spark2.0.2 and spark1.4.1 beeline to show the database, use the default operation such as dealing with different .why show databases,use default such operation need execute a job in spark2.0.2 .When a job task is very much, time is very long,such as query operation, lead to the back of the show databases, use the default operations such as waiting in line.But When a job task is very much, time is very long,such as query operation, lead to the back of the show databases, use the default operations no need to wait was: About spark2.0.2 and spark1.4.1 beeline to show the database, use the default operation such as dealing with different .why show databases,use default such operation need execute a job in spark2.0.2 .When a job task is very much, time is very long,such as query operation, lead to the back of the show databases, use the default operations such as waiting in line > About spark2.0.2 and spark1.4.1 beeline to show the database, use the default > operation such as dealing with different > -- > > Key: SPARK-19880 > URL: https://issues.apache.org/jira/browse/SPARK-19880 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: guoxiaolong > > About spark2.0.2 and spark1.4.1 beeline to show the database, use the default > operation such as dealing with different > .why show databases,use default such operation need execute a job in > spark2.0.2 .When a job task is very much, time is very long,such as query > operation, lead to the back of the show databases, use the default operations > such as waiting in line.But When a job task is very much, time is very > long,such as query operation, lead to the back of the show databases, use the > default operations no need to wait -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19880) About spark2.0.2 and spark1.4.1 beeline to show the database, use the default operation such as dealing with different
guoxiaolong created SPARK-19880: --- Summary: About spark2.0.2 and spark1.4.1 beeline to show the database, use the default operation such as dealing with different Key: SPARK-19880 URL: https://issues.apache.org/jira/browse/SPARK-19880 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: guoxiaolong About spark2.0.2 and spark1.4.1 beeline to show the database, use the default operation such as dealing with different .why show databases,use default such operation need execute a job in spark2.0.2 .When a job task is very much, time is very long,such as query operation, lead to the back of the show databases, use the default operations such as waiting in line -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.
[ https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19862: Assignee: (was: Apache Spark) > In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. > - > > Key: SPARK-19862 > URL: https://issues.apache.org/jira/browse/SPARK-19862 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: guoxiaolong >Priority: Trivial > > "tungsten-sort" -> > classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be > deleted. Because it is the same of "sort" -> > classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.
[ https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19862: Assignee: Apache Spark > In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. > - > > Key: SPARK-19862 > URL: https://issues.apache.org/jira/browse/SPARK-19862 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: guoxiaolong >Assignee: Apache Spark >Priority: Trivial > > "tungsten-sort" -> > classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be > deleted. Because it is the same of "sort" -> > classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.
[ https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902407#comment-15902407 ] Apache Spark commented on SPARK-19862: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/17220 > In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. > - > > Key: SPARK-19862 > URL: https://issues.apache.org/jira/browse/SPARK-19862 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: guoxiaolong >Priority: Trivial > > "tungsten-sort" -> > classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be > deleted. Because it is the same of "sort" -> > classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
[ https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kavn qin updated SPARK-19878: - External issue URL: (was: https://issues.apache.org/jira/browse/SPARK-17920) External issue ID: (was: SPARK-17920) Issue Type: Improvement (was: Wish) > Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala > -- > > Key: SPARK-19878 > URL: https://issues.apache.org/jira/browse/SPARK-19878 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0 > Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0 >Reporter: kavn qin >Priority: Minor > Labels: patch > Attachments: SPARK-19878.patch > > > When case class InsertIntoHiveTable intializes a serde it explicitly passes > null for the Configuration in Spark 1.5.0: > [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58] > While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just > passes null for the Configuration: > [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161] > When we implement a hive serde, we want to use the hive configuration to get > some static and dynamic settings, but we can not do it ! > So this patch add the configuration when initialize hive serde. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
[ https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kavn qin updated SPARK-19878: - Attachment: SPARK-19878.patch > Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala > -- > > Key: SPARK-19878 > URL: https://issues.apache.org/jira/browse/SPARK-19878 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0 > Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0 >Reporter: kavn qin >Priority: Minor > Labels: patch > Attachments: SPARK-19878.patch > > > When case class InsertIntoHiveTable intializes a serde it explicitly passes > null for the Configuration in Spark 1.5.0: > [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58] > While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just > passes null for the Configuration: > [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161] > When we implement a hive serde, we want to use the hive configuration to get > some static and dynamic settings, but we can not do it ! > So this patch add the configuration when initialize hive serde. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
[ https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kavn qin updated SPARK-19878: - Attachment: (was: SPARK-19878.patch) > Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala > -- > > Key: SPARK-19878 > URL: https://issues.apache.org/jira/browse/SPARK-19878 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0 > Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0 >Reporter: kavn qin >Priority: Minor > Labels: patch > > When case class InsertIntoHiveTable intializes a serde it explicitly passes > null for the Configuration in Spark 1.5.0: > [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58] > While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just > passes null for the Configuration: > [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161] > When we implement a hive serde, we want to use the hive configuration to get > some static and dynamic settings, but we can not do it ! > So this patch add the configuration when initialize hive serde. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
[ https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kavn qin updated SPARK-19878: - Description: When case class InsertIntoHiveTable intializes a serde it explicitly passes null for the Configuration in Spark 1.5.0: [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58] While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just passes null for the Configuration: [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161] When we implement a hive serde, we want to use the hive configuration to get some static and dynamic settings, but we can not do it ! So this patch add the configuration when initialize hive serde. was: When case class InsertIntoHiveTable intializes a serde it explicitly passes null for the Configuration in Spark 1.5.0: [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58] While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just passes null for the Configuration: [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161] When we implement a hive serde, we want to use the hive configuration to get some static and dynamic settings, we can not do it! This patch just add the configuration when initialize hive serde. > Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala > -- > > Key: SPARK-19878 > URL: https://issues.apache.org/jira/browse/SPARK-19878 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0 > Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0 >Reporter: kavn qin >Priority: Minor > Labels: patch > Attachments: SPARK-19878.patch > > > When case class InsertIntoHiveTable intializes a serde it explicitly passes > null for the Configuration in Spark 1.5.0: > [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58] > While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just > passes null for the Configuration: > [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161] > When we implement a hive serde, we want to use the hive configuration to get > some static and dynamic settings, but we can not do it ! > So this patch add the configuration when initialize hive serde. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12180) DataFrame.join() in PySpark gives misleading exception when column name exists on both side
[ https://issues.apache.org/jira/browse/SPARK-12180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Kumar updated SPARK-12180: --- Comment: was deleted (was: Is there any concrete solution or reason explaining the issue ? I am facing this in Spark 1.6.2 when I try to join two data-frames derived from same source data-frame.) > DataFrame.join() in PySpark gives misleading exception when column name > exists on both side > --- > > Key: SPARK-12180 > URL: https://issues.apache.org/jira/browse/SPARK-12180 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Daniel Thomas > > When joining two DataFrames on a column 'session_uuid' I got the following > exception, because both DataFrames hat a column called 'at'. The exception is > misleading in the cause and in the column causing the problem. Renaming the > column fixed the exception. > --- > Py4JJavaError Traceback (most recent call last) > /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/utils.py in > deco(*a, **kw) > 35 try: > ---> 36 return f(*a, **kw) > 37 except py4j.protocol.Py4JJavaError as e: > /Applications/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > Py4JJavaError: An error occurred while calling o484.join. > : org.apache.spark.sql.AnalysisException: resolved attribute(s) > session_uuid#3278 missing from > uuid_x#9078,total_session_sec#9115L,at#3248,session_uuid#9114,uuid#9117,at#9084 > in operator !Join Inner, Some((uuid_x#9078 = session_uuid#3278)); > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > AnalysisException Traceback (most recent call last) > in () > 1 sel_starts = starts.select('uuid', 'at').withColumnRenamed('uuid', > 'uuid_x')#.withColumnRenamed('at', 'at_x') > 2 sel_closes = closes.select('uuid', 'at', 'session_uuid', > 'total_session_sec') > > 3 start_close = sel_starts.join(sel_closes, sel_starts['uuid_x'] == > sel_closes['session_uuid']) > 4 start_close.cache() > 5 start_close.take(1) > /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/dataframe.py in > join(self, other, on, how) > 579 on = on[0] > 580 if how is None: > --> 581 jdf = self._jdf.join(other._jdf, on._jc, "inner") > 582 else: > 583 assert isinstance(how, basestring), "how should be > basestring" >
[jira] [Updated] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
[ https://issues.apache.org/jira/browse/SPARK-19878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kavn qin updated SPARK-19878: - Attachment: SPARK-19878.patch > Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala > -- > > Key: SPARK-19878 > URL: https://issues.apache.org/jira/browse/SPARK-19878 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 1.5.0, 1.6.0, 2.0.0 > Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0 >Reporter: kavn qin >Priority: Minor > Labels: patch > Attachments: SPARK-19878.patch > > > When case class InsertIntoHiveTable intializes a serde it explicitly passes > null for the Configuration in Spark 1.5.0: > [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58] > While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just > passes null for the Configuration: > [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161] > When we implement a hive serde, we want to use the hive configuration to get > some static and dynamic settings, we can not do it! > This patch just add the configuration when initialize hive serde. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19879) Spark UI table sort breaks event timeline
Sebastian Estevez created SPARK-19879: - Summary: Spark UI table sort breaks event timeline Key: SPARK-19879 URL: https://issues.apache.org/jira/browse/SPARK-19879 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.6.3 Reporter: Sebastian Estevez Priority: Minor When you sort the task list at the bottom of the page the new sorting order seems to get carried into the event timeline visualization placing tasks in the wrong hosts and across the wrong time periods. This is especially evident if a task has a different number of tasks running per executor or if there are some outlier long running tasks in one or a few of the executors. http://:4040/stages/stage/?id=0=0 http://:4040/stages/stage/?id=0=0=Duration=true=100 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19878) Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
kavn qin created SPARK-19878: Summary: Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala Key: SPARK-19878 URL: https://issues.apache.org/jira/browse/SPARK-19878 Project: Spark Issue Type: Wish Components: SQL Affects Versions: 2.0.0, 1.6.0, 1.5.0 Environment: Centos 6.5: Hadoop 2.6.0, Spark 1.5.0, Hive 1.1.0 Reporter: kavn qin Priority: Minor When case class InsertIntoHiveTable intializes a serde it explicitly passes null for the Configuration in Spark 1.5.0: [https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L58] While in Spark 2.0.0, the HiveWriterContainer intializes a serde it also just passes null for the Configuration: [https://github.com/apache/spark/blob/v2.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L161] When we implement a hive serde, we want to use the hive configuration to get some static and dynamic settings, we can not do it! This patch just add the configuration when initialize hive serde. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12180) DataFrame.join() in PySpark gives misleading exception when column name exists on both side
[ https://issues.apache.org/jira/browse/SPARK-12180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902389#comment-15902389 ] Abhishek Kumar edited comment on SPARK-12180 at 3/9/17 2:52 AM: Is there any concrete solution or reason explaining the issue ? I am facing this in Spark 1.6.2 when I try to join two data-frames derived from same source data-frame. was (Author: abhikumar): Is there any concrete solution or reason explaining the issue ? I have facing this in Spark 1.6.2 when I try to join two data-frames derived from same source data-frame. > DataFrame.join() in PySpark gives misleading exception when column name > exists on both side > --- > > Key: SPARK-12180 > URL: https://issues.apache.org/jira/browse/SPARK-12180 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Daniel Thomas > > When joining two DataFrames on a column 'session_uuid' I got the following > exception, because both DataFrames hat a column called 'at'. The exception is > misleading in the cause and in the column causing the problem. Renaming the > column fixed the exception. > --- > Py4JJavaError Traceback (most recent call last) > /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/utils.py in > deco(*a, **kw) > 35 try: > ---> 36 return f(*a, **kw) > 37 except py4j.protocol.Py4JJavaError as e: > /Applications/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > Py4JJavaError: An error occurred while calling o484.join. > : org.apache.spark.sql.AnalysisException: resolved attribute(s) > session_uuid#3278 missing from > uuid_x#9078,total_session_sec#9115L,at#3248,session_uuid#9114,uuid#9117,at#9084 > in operator !Join Inner, Some((uuid_x#9078 = session_uuid#3278)); > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > AnalysisException Traceback (most recent call last) > in () > 1 sel_starts = starts.select('uuid', 'at').withColumnRenamed('uuid', > 'uuid_x')#.withColumnRenamed('at', 'at_x') > 2 sel_closes = closes.select('uuid', 'at', 'session_uuid', > 'total_session_sec') > > 3 start_close = sel_starts.join(sel_closes, sel_starts['uuid_x'] == > sel_closes['session_uuid']) > 4 start_close.cache() > 5 start_close.take(1) > /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/dataframe.py in > join(self, other, on, how) > 579 on = on[0] > 580 if how is None: > -->
[jira] [Commented] (SPARK-12180) DataFrame.join() in PySpark gives misleading exception when column name exists on both side
[ https://issues.apache.org/jira/browse/SPARK-12180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902389#comment-15902389 ] Abhishek Kumar commented on SPARK-12180: Is there any concrete solution or reason explaining the issue ? I have facing this in Spark 1.6.2 when I try to join two data-frames derived from same source data-frame. > DataFrame.join() in PySpark gives misleading exception when column name > exists on both side > --- > > Key: SPARK-12180 > URL: https://issues.apache.org/jira/browse/SPARK-12180 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Daniel Thomas > > When joining two DataFrames on a column 'session_uuid' I got the following > exception, because both DataFrames hat a column called 'at'. The exception is > misleading in the cause and in the column causing the problem. Renaming the > column fixed the exception. > --- > Py4JJavaError Traceback (most recent call last) > /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/utils.py in > deco(*a, **kw) > 35 try: > ---> 36 return f(*a, **kw) > 37 except py4j.protocol.Py4JJavaError as e: > /Applications/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > Py4JJavaError: An error occurred while calling o484.join. > : org.apache.spark.sql.AnalysisException: resolved attribute(s) > session_uuid#3278 missing from > uuid_x#9078,total_session_sec#9115L,at#3248,session_uuid#9114,uuid#9117,at#9084 > in operator !Join Inner, Some((uuid_x#9078 = session_uuid#3278)); > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > AnalysisException Traceback (most recent call last) > in () > 1 sel_starts = starts.select('uuid', 'at').withColumnRenamed('uuid', > 'uuid_x')#.withColumnRenamed('at', 'at_x') > 2 sel_closes = closes.select('uuid', 'at', 'session_uuid', > 'total_session_sec') > > 3 start_close = sel_starts.join(sel_closes, sel_starts['uuid_x'] == > sel_closes['session_uuid']) > 4 start_close.cache() > 5 start_close.take(1) > /Applications/spark-1.5.2-bin-hadoop2.4/python/pyspark/sql/dataframe.py in > join(self, other, on, how) > 579 on = on[0] > 580 if how is None: > --> 581 jdf = self._jdf.join(other._jdf, on._jc, "inner") > 582 else: > 583 assert isinstance(how, basestring), "how should be > basestring" >
[jira] [Commented] (SPARK-19859) The new watermark should override the old one
[ https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902368#comment-15902368 ] Apache Spark commented on SPARK-19859: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/17221 > The new watermark should override the old one > - > > Key: SPARK-19859 > URL: https://issues.apache.org/jira/browse/SPARK-19859 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > The new watermark should override the old one. Otherwise, we just pick up the > first column which has a watermark, it may be unexpected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19877) Restrict the depth of view reference chains
Jiang Xingbo created SPARK-19877: Summary: Restrict the depth of view reference chains Key: SPARK-19877 URL: https://issues.apache.org/jira/browse/SPARK-19877 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Jiang Xingbo We should restrict the depth of of view reference chains, to avoid stack overflow exception during resolution of nested views. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19877) Restrict the depth of view reference chains
[ https://issues.apache.org/jira/browse/SPARK-19877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902324#comment-15902324 ] Jiang Xingbo commented on SPARK-19877: -- I'm working on this. > Restrict the depth of view reference chains > --- > > Key: SPARK-19877 > URL: https://issues.apache.org/jira/browse/SPARK-19877 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jiang Xingbo > > We should restrict the depth of of view reference chains, to avoid stack > overflow exception during resolution of nested views. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19808) About the default blocking arg in unpersist
[ https://issues.apache.org/jira/browse/SPARK-19808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902317#comment-15902317 ] zhengruifeng commented on SPARK-19808: -- [~srowen] Agreed. Changing the default may cause latent issues. I will close this jira. > About the default blocking arg in unpersist > --- > > Key: SPARK-19808 > URL: https://issues.apache.org/jira/browse/SPARK-19808 > Project: Spark > Issue Type: Question > Components: ML, Spark Core >Affects Versions: 2.1.0 >Reporter: zhengruifeng >Priority: Minor > > Now, {{unpersist}} are commonly used with default value in ML. > Most algorithms like {{KMeans}} use {{RDD.unpersisit}} and the default > {{blocking}} is {{true}} > And for meta algorithms like {{OneVsRest}}, {{CrossValidator}} use > {{Dataset.unpersist}} and the default {{blocking}} is {{false}} > Should the default value for {{RDD.unpersisit}} and {{Dataset.unpersist}} be > consistent? > And all the {{blocking}} arg in ML should be set {{false}}? > [~srowen] [~mlnick] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19808) About the default blocking arg in unpersist
[ https://issues.apache.org/jira/browse/SPARK-19808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng closed SPARK-19808. Resolution: Not A Problem > About the default blocking arg in unpersist > --- > > Key: SPARK-19808 > URL: https://issues.apache.org/jira/browse/SPARK-19808 > Project: Spark > Issue Type: Question > Components: ML, Spark Core >Affects Versions: 2.1.0 >Reporter: zhengruifeng >Priority: Minor > > Now, {{unpersist}} are commonly used with default value in ML. > Most algorithms like {{KMeans}} use {{RDD.unpersisit}} and the default > {{blocking}} is {{true}} > And for meta algorithms like {{OneVsRest}}, {{CrossValidator}} use > {{Dataset.unpersist}} and the default {{blocking}} is {{false}} > Should the default value for {{RDD.unpersisit}} and {{Dataset.unpersist}} be > consistent? > And all the {{blocking}} arg in ML should be set {{false}}? > [~srowen] [~mlnick] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901132#comment-15901132 ] chenerlu edited comment on SPARK-16283 at 3/9/17 1:55 AM: -- Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ? for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result. c4_double is show below: 1.0001 2.0001 3.0001 4.0001 5.0001 6.0001 7.0001 8.0001 9.0001 NULL -8.952 -96.0 Hive: [-87.2952,-6.9615799,1.30009998,2.40010003] spark 2.x: [-8.952,1.0001,2.0001,3.0001] so which result is right ? Could you pls reply me when you are free. [~rxin] [~lwlin] was (Author: erlu): Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ? for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result. c4_double is show below: 1.0001 2.0001 3.0001 4.0001 5.0001 6.0001 7.0001 8.0001 9.0001 NULL -8.952 -96.0 Hive: [-87.2952,-6.9615799,1.30009998,2.40010003] spark 2.x: [-8.952,1.0001,2.0001,3.0001] so which result is right ? Could you pls reply me when you are free. [~rxin] [~linwei] > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Sean Zhong > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901132#comment-15901132 ] chenerlu edited comment on SPARK-16283 at 3/9/17 1:55 AM: -- Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ? for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result. c4_double is show below: 1.0001 2.0001 3.0001 4.0001 5.0001 6.0001 7.0001 8.0001 9.0001 NULL -8.952 -96.0 Hive: [-87.2952,-6.9615799,1.30009998,2.40010003] spark 2.x: [-8.952,1.0001,2.0001,3.0001] so which result is right ? Could you pls reply me when you are free. [~rxin] [~linwei] was (Author: erlu): Hi, I am little confused about percentile_approx, is it different from hive's now ? will we get different result when the input is same ? for example, I run select percentile_approx(c4_double,array(0.1,0.2,0.3,0.4)) from test; and get different result. c4_double is show below: 1.0001 2.0001 3.0001 4.0001 5.0001 6.0001 7.0001 8.0001 9.0001 NULL -8.952 -96.0 Hive: [-87.2952,-6.9615799,1.30009998,2.40010003] spark 2.x: [-8.952,1.0001,2.0001,3.0001] so which result is right ? Could you pls reply me when you are free. > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Sean Zhong > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.
[ https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902263#comment-15902263 ] guoxiaolong commented on SPARK-19862: - remove tungsten-sort.Because it is not represent 'org.apache.spark.shuffle.unsafe.UnsafeShuffleManager'.so,it should be deleted. > In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. > - > > Key: SPARK-19862 > URL: https://issues.apache.org/jira/browse/SPARK-19862 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: guoxiaolong >Priority: Trivial > > "tungsten-sort" -> > classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be > deleted. Because it is the same of "sort" -> > classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data
[ https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19507. -- Resolution: Duplicate Actually, it seems a duplicate of SPARK-19871. Let me resolve this one as that one already has a PR. Please reopen this if I am mistaken. > pyspark.sql.types._verify_type() exceptions too broad to debug collections or > nested data > - > > Key: SPARK-19507 > URL: https://issues.apache.org/jira/browse/SPARK-19507 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.1.0 > Environment: macOS Sierra 10.12.3 > Spark 2.1.0, installed via Homebrew >Reporter: David Gingrich >Priority: Trivial > > The private function pyspark.sql.types._verify_type() recursively checks an > object against a datatype, raising an exception if the object does not > satisfy the type. These messages are not specific enough to debug a data > error in a collection or nested data, for instance: > {quote} > >>> import pyspark.sql.types as typ > >>> schema = typ.StructType([typ.StructField('nest1', > >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(]) > >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema) > Traceback (most recent call last): > File "", line 1, in > File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in > _verify_type > _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name) > File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in > _verify_type > _verify_type(v, dataType.valueType, dataType.valueContainsNull, > name=new_name) > File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in > _verify_type > _verify_type(i, dataType.elementType, dataType.containsNull, > name=new_name) > File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in > _verify_type > % (name, dataType, obj, type(obj))) > TypeError: FloatType can not accept object 1 in type > {quote} > Passing and printing a field name would make debugging easier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19876) Add OneTime trigger executor
[ https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19876: Assignee: (was: Apache Spark) > Add OneTime trigger executor > > > Key: SPARK-19876 > URL: https://issues.apache.org/jira/browse/SPARK-19876 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tyson Condie > Fix For: 2.2.0 > > > The goal is to add a new trigger executor that will process a single trigger > then stop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19876) Add OneTime trigger executor
[ https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902221#comment-15902221 ] Apache Spark commented on SPARK-19876: -- User 'tcondie' has created a pull request for this issue: https://github.com/apache/spark/pull/17219 > Add OneTime trigger executor > > > Key: SPARK-19876 > URL: https://issues.apache.org/jira/browse/SPARK-19876 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tyson Condie > Fix For: 2.2.0 > > > The goal is to add a new trigger executor that will process a single trigger > then stop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19876) Add OneTime trigger executor
[ https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19876: Assignee: Apache Spark > Add OneTime trigger executor > > > Key: SPARK-19876 > URL: https://issues.apache.org/jira/browse/SPARK-19876 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tyson Condie >Assignee: Apache Spark > Fix For: 2.2.0 > > > The goal is to add a new trigger executor that will process a single trigger > then stop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Bruggeman updated SPARK-19872: Priority: Blocker (was: Major) > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman >Priority: Blocker > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19876) Add OneTime trigger executor
Tyson Condie created SPARK-19876: Summary: Add OneTime trigger executor Key: SPARK-19876 URL: https://issues.apache.org/jira/browse/SPARK-19876 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Tyson Condie Fix For: 2.2.0 The goal is to add a new trigger executor that will process a single trigger then stop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19281) spark.ml Python API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19281: Assignee: Apache Spark > spark.ml Python API for FPGrowth > > > Key: SPARK-19281 > URL: https://issues.apache.org/jira/browse/SPARK-19281 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > See parent issue. This is for a Python API *after* the Scala API has been > designed and implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19281) spark.ml Python API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19281: Assignee: (was: Apache Spark) > spark.ml Python API for FPGrowth > > > Key: SPARK-19281 > URL: https://issues.apache.org/jira/browse/SPARK-19281 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > > See parent issue. This is for a Python API *after* the Scala API has been > designed and implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19281) spark.ml Python API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902198#comment-15902198 ] Apache Spark commented on SPARK-19281: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17218 > spark.ml Python API for FPGrowth > > > Key: SPARK-19281 > URL: https://issues.apache.org/jira/browse/SPARK-19281 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > > See parent issue. This is for a Python API *after* the Scala API has been > designed and implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code
[ https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Pranavamurthi updated SPARK-19875: -- Description: The attached code (TestFilter.scala) works with a 10-column csv dataset, but gets stuck with a 50-column csv dataset. Both datasets are attached. was: The attached code works with a 10-column csv dataset, but gets stuck with a 50-column csv dataset. Both datasets are attached. > Map->filter on many columns gets stuck in constraint inference optimization > code > > > Key: SPARK-19875 > URL: https://issues.apache.org/jira/browse/SPARK-19875 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi > Attachments: test10cols.csv, test50cols.csv, TestFilter.scala > > > The attached code (TestFilter.scala) works with a 10-column csv dataset, but > gets stuck with a 50-column csv dataset. Both datasets are attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code
[ https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Pranavamurthi updated SPARK-19875: -- Attachment: TestFilter.scala test50cols.csv test10cols.csv > Map->filter on many columns gets stuck in constraint inference optimization > code > > > Key: SPARK-19875 > URL: https://issues.apache.org/jira/browse/SPARK-19875 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi > Attachments: test10cols.csv, test50cols.csv, TestFilter.scala > > > The attached code works with a 10-column csv dataset, but gets stuck with a > 50-column csv dataset. Both datasets are attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code
Jay Pranavamurthi created SPARK-19875: - Summary: Map->filter on many columns gets stuck in constraint inference optimization code Key: SPARK-19875 URL: https://issues.apache.org/jira/browse/SPARK-19875 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Jay Pranavamurthi The attached code works with a 10-column csv dataset, but gets stuck with a 50-column csv dataset. Both datasets are attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6936) SQLContext.sql() caused deadlock in multi-thread env
[ https://issues.apache.org/jira/browse/SPARK-6936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902167#comment-15902167 ] Henry Min commented on SPARK-6936: -- This issue seems has been fixed on the version 1.5.0. The information here is extremely important. Can anyone provide the merged details and show me the merged source code? Because the similar issue still exists on the latest version. > SQLContext.sql() caused deadlock in multi-thread env > > > Key: SPARK-6936 > URL: https://issues.apache.org/jira/browse/SPARK-6936 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: JDK 1.8.x, RedHat > Linux version 2.6.32-431.23.3.el6.x86_64 > (mockbu...@x86-027.build.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red > Hat 4.4.7-4) (GCC) ) #1 SMP Wed Jul 16 06:12:23 EDT 2014 >Reporter: Paul Wu >Assignee: Michael Armbrust > Labels: deadlock, sql, threading > Fix For: 1.5.0 > > > Doing (the same query) in more than one threads with SQLConext.sql may lead > to deadlock. Here is a way to reproduce it (since this is multi-thread issue, > the reproduction may or may not be so easy). > 1. Register a relatively big table. > 2. Create two different classes and in the classes, do the same query in a > method and put the results in a set and print out the set size. > 3. Create two threads to use an object from each class in the run method. > Start the threads. For my tests, it can have a deadlock just in a few runs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19540) Add ability to clone SparkSession with an identical copy of the SessionState
[ https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-19540: Assignee: Kunal Khamar > Add ability to clone SparkSession with an identical copy of the SessionState > > > Key: SPARK-19540 > URL: https://issues.apache.org/jira/browse/SPARK-19540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kunal Khamar >Assignee: Kunal Khamar > Fix For: 2.2.0 > > > Forking a newSession() from SparkSession currently makes a new SparkSession > that does not retain SessionState (i.e. temporary tables, SQL config, > registered functions etc.) This change adds a method cloneSession() which > creates a new SparkSession with a copy of the parent's SessionState. > Subsequent changes to base session are not propagated to cloned session, > clone is independent after creation. > If the base is changed after clone has been created, say user registers new > UDF, then the new UDF will not be available inside the clone. Same goes for > configs and temp tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19540) Add ability to clone SparkSession with an identical copy of the SessionState
[ https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19540. -- Resolution: Fixed Fix Version/s: 2.2.0 > Add ability to clone SparkSession with an identical copy of the SessionState > > > Key: SPARK-19540 > URL: https://issues.apache.org/jira/browse/SPARK-19540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > Fix For: 2.2.0 > > > Forking a newSession() from SparkSession currently makes a new SparkSession > that does not retain SessionState (i.e. temporary tables, SQL config, > registered functions etc.) This change adds a method cloneSession() which > creates a new SparkSession with a copy of the parent's SessionState. > Subsequent changes to base session are not propagated to cloned session, > clone is independent after creation. > If the base is changed after clone has been created, say user registers new > UDF, then the new UDF will not be available inside the clone. Same goes for > configs and temp tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"
[ https://issues.apache.org/jira/browse/SPARK-19874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902161#comment-15902161 ] Apache Spark commented on SPARK-19874: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17217 > Hide API docs for "org.apache.spark.sql.internal" > - > > Key: SPARK-19874 > URL: https://issues.apache.org/jira/browse/SPARK-19874 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > The API docs should not include the "org.apache.spark.sql.internal" package > because they are internal private APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"
[ https://issues.apache.org/jira/browse/SPARK-19874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19874: Assignee: Apache Spark (was: Shixiong Zhu) > Hide API docs for "org.apache.spark.sql.internal" > - > > Key: SPARK-19874 > URL: https://issues.apache.org/jira/browse/SPARK-19874 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > The API docs should not include the "org.apache.spark.sql.internal" package > because they are internal private APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"
[ https://issues.apache.org/jira/browse/SPARK-19874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19874: Assignee: Shixiong Zhu (was: Apache Spark) > Hide API docs for "org.apache.spark.sql.internal" > - > > Key: SPARK-19874 > URL: https://issues.apache.org/jira/browse/SPARK-19874 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > The API docs should not include the "org.apache.spark.sql.internal" package > because they are internal private APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19874) Hide API docs for "org.apache.spark.sql.internal"
Shixiong Zhu created SPARK-19874: Summary: Hide API docs for "org.apache.spark.sql.internal" Key: SPARK-19874 URL: https://issues.apache.org/jira/browse/SPARK-19874 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu The API docs should not include the "org.apache.spark.sql.internal" package because they are internal private APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19873) If the user changes the number of shuffle partitions between batches, Streaming aggregation will fail.
[ https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19873: Assignee: Apache Spark > If the user changes the number of shuffle partitions between batches, > Streaming aggregation will fail. > -- > > Key: SPARK-19873 > URL: https://issues.apache.org/jira/browse/SPARK-19873 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kunal Khamar >Assignee: Apache Spark > > If the user changes the shuffle partition number between batches, Streaming > aggregation will fail. > Here are some possible cases: > - Change "spark.sql.shuffle.partitions" > - Use "repartition" and change the partition number in codes > - RangePartitioner doesn't generate deterministic partitions. Right now it's > safe as we disallow sort before aggregation. Not sure if we will add some > operators using RangePartitioner in future. > Fix: > Record # shuffle partitions in offset log and enforce in next batch -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19873) If the user changes the number of shuffle partitions between batches, Streaming aggregation will fail.
[ https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902122#comment-15902122 ] Apache Spark commented on SPARK-19873: -- User 'kunalkhamar' has created a pull request for this issue: https://github.com/apache/spark/pull/17216 > If the user changes the number of shuffle partitions between batches, > Streaming aggregation will fail. > -- > > Key: SPARK-19873 > URL: https://issues.apache.org/jira/browse/SPARK-19873 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kunal Khamar > > If the user changes the shuffle partition number between batches, Streaming > aggregation will fail. > Here are some possible cases: > - Change "spark.sql.shuffle.partitions" > - Use "repartition" and change the partition number in codes > - RangePartitioner doesn't generate deterministic partitions. Right now it's > safe as we disallow sort before aggregation. Not sure if we will add some > operators using RangePartitioner in future. > Fix: > Record # shuffle partitions in offset log and enforce in next batch -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19873) If the user changes the number of shuffle partitions between batches, Streaming aggregation will fail.
[ https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19873: Assignee: (was: Apache Spark) > If the user changes the number of shuffle partitions between batches, > Streaming aggregation will fail. > -- > > Key: SPARK-19873 > URL: https://issues.apache.org/jira/browse/SPARK-19873 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kunal Khamar > > If the user changes the shuffle partition number between batches, Streaming > aggregation will fail. > Here are some possible cases: > - Change "spark.sql.shuffle.partitions" > - Use "repartition" and change the partition number in codes > - RangePartitioner doesn't generate deterministic partitions. Right now it's > safe as we disallow sort before aggregation. Not sure if we will add some > operators using RangePartitioner in future. > Fix: > Record # shuffle partitions in offset log and enforce in next batch -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
[ https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19858: - Affects Version/s: (was: 2.1.1) > Add output mode to flatMapGroupsWithState and disallow invalid cases > > > Key: SPARK-19858 > URL: https://issues.apache.org/jira/browse/SPARK-19858 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
[ https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19858. -- Resolution: Fixed Fix Version/s: 2.2.0 > Add output mode to flatMapGroupsWithState and disallow invalid cases > > > Key: SPARK-19858 > URL: https://issues.apache.org/jira/browse/SPARK-19858 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19413) Basic mapGroupsWithState API
[ https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19413: - Fix Version/s: (was: 2.1.1) > Basic mapGroupsWithState API > > > Key: SPARK-19413 > URL: https://issues.apache.org/jira/browse/SPARK-19413 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.2.0 > > > Basic API (without timeouts) as described in the parent JIRA SPARK-19067 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19413) Basic mapGroupsWithState API
[ https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902105#comment-15902105 ] Shixiong Zhu commented on SPARK-19413: -- Reverted the patch from branch 2.1. This feature will not go into 2.1.1. > Basic mapGroupsWithState API > > > Key: SPARK-19413 > URL: https://issues.apache.org/jira/browse/SPARK-19413 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.2.0 > > > Basic API (without timeouts) as described in the parent JIRA SPARK-19067 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19413) Basic mapGroupsWithState API
[ https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19413: - Target Version/s: 2.2.0 (was: 2.1.1, 2.2.0) > Basic mapGroupsWithState API > > > Key: SPARK-19413 > URL: https://issues.apache.org/jira/browse/SPARK-19413 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.2.0 > > > Basic API (without timeouts) as described in the parent JIRA SPARK-19067 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource
[ https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-19813. - Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > maxFilesPerTrigger combo latestFirst may miss old files in combination with > maxFileAge in FileStreamSource > -- > > Key: SPARK-19813 > URL: https://issues.apache.org/jira/browse/SPARK-19813 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.1.1, 2.2.0 > > > There is a file stream source option called maxFileAge which limits how old > the files can be, relative the latest file that has been seen. This is used > to limit the files that need to be remembered as "processed". Files older > than the latest processed files are ignored. This values is by default 7 days. > This causes a problem when both > - latestFirst = true > - maxFilesPerTrigger > total files to be processed. > Here is what happens in all combinations > 1) latestFirst = false - Since files are processed in order, there wont be > any unprocessed file older than the latest processed file. All files will be > processed. > 2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge > thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is > not, then all old files get processed in the first batch, and so no file is > left behind. > 3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch > process the latest X files. That sets the threshold latest file - maxFileAge, > so files older than this threshold will never be considered for processing. > The bug is with case 3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902023#comment-15902023 ] Brian Bruggeman edited comment on SPARK-19872 at 3/8/17 10:33 PM: -- Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. was (Author: bbruggeman): Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. Therefore there must be a bug in the Spark 2.1.0 rdd.py file. I attempted the naive approach of reverting both {code}def _load_from_socket{code} and {code}def collect{code}, but I didn't see any appreciable difference within the output. > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To
[jira] [Updated] (SPARK-19540) Add ability to clone SparkSession with an identical copy of the SessionState
[ https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-19540: - Summary: Add ability to clone SparkSession with an identical copy of the SessionState (was: Add ability to clone SparkSession wherein cloned session has an identical copy of the SessionState) > Add ability to clone SparkSession with an identical copy of the SessionState > > > Key: SPARK-19540 > URL: https://issues.apache.org/jira/browse/SPARK-19540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > > Forking a newSession() from SparkSession currently makes a new SparkSession > that does not retain SessionState (i.e. temporary tables, SQL config, > registered functions etc.) This change adds a method cloneSession() which > creates a new SparkSession with a copy of the parent's SessionState. > Subsequent changes to base session are not propagated to cloned session, > clone is independent after creation. > If the base is changed after clone has been created, say user registers new > UDF, then the new UDF will not be available inside the clone. Same goes for > configs and temp tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has an identical copy of the SessionState
[ https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-19540: - Summary: Add ability to clone SparkSession wherein cloned session has an identical copy of the SessionState (was: Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState) > Add ability to clone SparkSession wherein cloned session has an identical > copy of the SessionState > -- > > Key: SPARK-19540 > URL: https://issues.apache.org/jira/browse/SPARK-19540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kunal Khamar > > Forking a newSession() from SparkSession currently makes a new SparkSession > that does not retain SessionState (i.e. temporary tables, SQL config, > registered functions etc.) This change adds a method cloneSession() which > creates a new SparkSession with a copy of the parent's SessionState. > Subsequent changes to base session are not propagated to cloned session, > clone is independent after creation. > If the base is changed after clone has been created, say user registers new > UDF, then the new UDF will not be available inside the clone. Same goes for > configs and temp tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC
[ https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon King closed SPARK-19814. -- Resolution: Duplicate Looks like it's wrong to characterize this as a bug -- couldn't identify an actual memory leak. Seems more like we'll have to wait for the major overhaul proposed by https://issues.apache.org/jira/browse/SPARK-18085 > Spark History Server Out Of Memory / Extreme GC > --- > > Key: SPARK-19814 > URL: https://issues.apache.org/jira/browse/SPARK-19814 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.1, 2.0.0, 2.1.0 > Environment: Spark History Server (we've run it on several different > Hadoop distributions) >Reporter: Simon King > Attachments: SparkHistoryCPUandRAM.png > > > Spark History Server runs out of memory, gets into GC thrash and eventually > becomes unresponsive. This seems to happen more quickly with heavy use of the > REST API. We've seen this with several versions of Spark. > Running with the following settings (spark 2.1): > spark.history.fs.cleaner.enabledtrue > spark.history.fs.cleaner.interval 1d > spark.history.fs.cleaner.maxAge 7d > spark.history.retainedApplications 500 > We will eventually get errors like: > 17/02/25 05:02:19 WARN ServletHandler:· > javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: > GC overhead limit exceeded (of class java.lang.OutOfMemoryError) > at > org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489) > at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit > exceeded (of class java.lang.OutOfMemoryError) > at > org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148) > at > org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110) > at > org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244) > at > org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49) > at > org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66) > at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178) > at > org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) > at > org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112) > at >
[jira] [Comment Edited] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902023#comment-15902023 ] Brian Bruggeman edited comment on SPARK-19872 at 3/8/17 9:48 PM: - Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. Therefore there must be a bug in the Spark 2.1.0 rdd.py file. I attempted the naive approach of reverting both {code}def _load_from_socket{code} and {code}def collect{code}, but I didn't see any appreciable difference within the output. was (Author: bbruggeman): Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. Therefore there must be a bug in the Spark 2.1.0 rdd.py file. > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902023#comment-15902023 ] Brian Bruggeman edited comment on SPARK-19872 at 3/8/17 9:46 PM: - Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. Therefore there must be a bug in the Spark 2.1.0 rdd.py file. was (Author: bbruggeman): Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. Therefore there must be a bug in the rdd.py file somewhere. > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902023#comment-15902023 ] Brian Bruggeman commented on SPARK-19872: - Using the Spark 2.1.0 serializers.py and the Spark 2.0.0 rdd.py, the code runs. Therefore there must be a bug in the rdd.py file somewhere. > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902023#comment-15902023 ] Brian Bruggeman edited comment on SPARK-19872 at 3/8/17 9:46 PM: - Using the Spark 2.1.0 serializers.py and the Spark 2.0.2 rdd.py, the code runs. Therefore there must be a bug in the rdd.py file somewhere. was (Author: bbruggeman): Using the Spark 2.1.0 serializers.py and the Spark 2.0.0 rdd.py, the code runs. Therefore there must be a bug in the rdd.py file somewhere. > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]
[ https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-15463: --- Assignee: Hyukjin Kwon > Support for creating a dataframe from CSV in Dataset[String] > > > Key: SPARK-15463 > URL: https://issues.apache.org/jira/browse/SPARK-15463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: PJ Fanning >Assignee: Hyukjin Kwon > Fix For: 2.2.0 > > > I currently use Databrick's spark-csv lib but some features don't work with > Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV > support into spark-sql directly, that spark-csv won't be modified. > I currently read some CSV data that has been pre-processed and is in > RDD[String] format. > There is sqlContext.read.json(rdd: RDD[String]) but other formats don't > appear to support the creation of DataFrames based on loading from > RDD[String]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]
[ https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15463. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16854 [https://github.com/apache/spark/pull/16854] > Support for creating a dataframe from CSV in Dataset[String] > > > Key: SPARK-15463 > URL: https://issues.apache.org/jira/browse/SPARK-15463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: PJ Fanning > Fix For: 2.2.0 > > > I currently use Databrick's spark-csv lib but some features don't work with > Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV > support into spark-sql directly, that spark-csv won't be modified. > I currently read some CSV data that has been pre-processed and is in > RDD[String] format. > There is sqlContext.read.json(rdd: RDD[String]) but other formats don't > appear to support the creation of DataFrames based on loading from > RDD[String]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
[ https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19858: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-19067 > Add output mode to flatMapGroupsWithState and disallow invalid cases > > > Key: SPARK-19858 > URL: https://issues.apache.org/jira/browse/SPARK-19858 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.1.1, 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
[ https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19858: - Affects Version/s: 2.1.1 > Add output mode to flatMapGroupsWithState and disallow invalid cases > > > Key: SPARK-19858 > URL: https://issues.apache.org/jira/browse/SPARK-19858 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.1, 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901997#comment-15901997 ] Brian Bruggeman commented on SPARK-19872: - I reverted `rdd.py` and `serializers.py` to the 2.0.2 branch in github and the code above works without an error. rdd.py link: https://github.com/apache/spark/blob/branch-2.0/python/pyspark/rdd.py serializers.py link: https://github.com/apache/spark/blob/branch-2.0/python/pyspark/serializers.py rdd diff: {code} --- tmp2.py 2017-03-08 15:17:45.0 -0600 +++ saved2.py 2017-03-08 15:17:59.0 -0600 @@ -52,8 +52,6 @@ get_used_memory, ExternalSorter, ExternalGroupBy from pyspark.traceback_utils import SCCallSiteSync -from py4j.java_collections import ListConverter, MapConverter - __all__ = ["RDD"] @@ -137,11 +135,12 @@ break if not sock: raise Exception("could not open socket") -# The RDD materialization time is unpredicable, if we set a timeout for socket reading -# operation, it will very possibly fail. See SPARK-18281. -sock.settimeout(None) -# The socket will be automatically closed when garbage-collected. -return serializer.load_stream(sock.makefile("rb", 65536)) +try: +rf = sock.makefile("rb", 65536) +for item in serializer.load_stream(rf): +yield item +finally: +sock.close() def ignore_unicode_prefix(f): @@ -264,13 +263,44 @@ def isCheckpointed(self): """ -Return whether this RDD has been checkpointed or not +Return whether this RDD is checkpointed and materialized, either reliably or locally. """ return self._jrdd.rdd().isCheckpointed() +def localCheckpoint(self): +""" +Mark this RDD for local checkpointing using Spark's existing caching layer. + +This method is for users who wish to truncate RDD lineages while skipping the expensive +step of replicating the materialized data in a reliable distributed file system. This is +useful for RDDs with long lineages that need to be truncated periodically (e.g. GraphX). + +Local checkpointing sacrifices fault-tolerance for performance. In particular, checkpointed +data is written to ephemeral local storage in the executors instead of to a reliable, +fault-tolerant storage. The effect is that if an executor fails during the computation, +the checkpointed data may no longer be accessible, causing an irrecoverable job failure. + +This is NOT safe to use with dynamic allocation, which removes executors along +with their cached blocks. If you must use both features, you are advised to set +L{spark.dynamicAllocation.cachedExecutorIdleTimeout} to a high value. + +The checkpoint directory set through L{SparkContext.setCheckpointDir()} is not used. +""" +self._jrdd.rdd().localCheckpoint() + +def isLocallyCheckpointed(self): +""" +Return whether this RDD is marked for local checkpointing. + +Exposed for testing. +""" +return self._jrdd.rdd().isLocallyCheckpointed() + def getCheckpointFile(self): """ Gets the name of the file to which this RDD was checkpointed + +Not defined if RDD is checkpointed locally. """ checkpointFile = self._jrdd.rdd().getCheckpointFile() if checkpointFile.isDefined(): @@ -387,6 +417,9 @@ with replacement: expected number of times each element is chosen; fraction must be >= 0 :param seed: seed for the random number generator +.. note:: This is not guaranteed to provide exactly the fraction specified of the total +count of the given :class:`DataFrame`. + >>> rdd = sc.parallelize(range(100), 4) >>> 6 <= rdd.sample(False, 0.1, 81).count() <= 14 True @@ -425,8 +458,8 @@ """ Return a fixed-size sampled subset of this RDD. -Note that this method should only be used if the resulting array is expected -to be small, as all the data is loaded into the driver's memory. +.. note:: This method should only be used if the resulting array is expected +to be small, as all the data is loaded into the driver's memory. >>> rdd = sc.parallelize(range(0, 10)) >>> len(rdd.takeSample(True, 20, 1)) @@ -537,7 +570,7 @@ Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did. -Note that this method performs a shuffle internally. +.. note:: This method performs a shuffle internally. >>> rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5]) >>> rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8]) @@ -768,8 +801,9 @@ def collect(self): """ Return a list that
[jira] [Updated] (SPARK-19873) If the user changes the number of shuffle partitions between batches, Streaming aggregation will fail.
[ https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-19873: - Summary: If the user changes the number of shuffle partitions between batches, Streaming aggregation will fail. (was: If the user changes the shuffle partition number between batches, Streaming aggregation will fail.) > If the user changes the number of shuffle partitions between batches, > Streaming aggregation will fail. > -- > > Key: SPARK-19873 > URL: https://issues.apache.org/jira/browse/SPARK-19873 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kunal Khamar > > If the user changes the shuffle partition number between batches, Streaming > aggregation will fail. > Here are some possible cases: > - Change "spark.sql.shuffle.partitions" > - Use "repartition" and change the partition number in codes > - RangePartitioner doesn't generate deterministic partitions. Right now it's > safe as we disallow sort before aggregation. Not sure if we will add some > operators using RangePartitioner in future. > Fix: > Record # shuffle partitions in offset log and enforce in next batch -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19873) If the user changes the shuffle partition number between batches, Streaming aggregation will fail.
[ https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-19873: - Description: If the user changes the shuffle partition number between batches, Streaming aggregation will fail. Here are some possible cases: - Change "spark.sql.shuffle.partitions" - Use "repartition" and change the partition number in codes - RangePartitioner doesn't generate deterministic partitions. Right now it's safe as we disallow sort before aggregation. Not sure if we will add some operators using RangePartitioner in future. Fix: Record # shuffle partition in offset log and enforce in next batch was: It the user changes the shuffle partition number between batches, Streaming aggregation will fail. Here are some possible cases: - Change "spark.sql.shuffle.partitions" - Use "repartition" and change the partition number in codes - RangePartitioner doesn't generate deterministic partitions. Right now it's safe as we disallow sort before aggregation. Not sure if we will add some operators using RangePartitioner in future. Fix: Record # shuffle partition in offset log and enforce in next batch > If the user changes the shuffle partition number between batches, Streaming > aggregation will fail. > -- > > Key: SPARK-19873 > URL: https://issues.apache.org/jira/browse/SPARK-19873 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kunal Khamar > > If the user changes the shuffle partition number between batches, Streaming > aggregation will fail. > Here are some possible cases: > - Change "spark.sql.shuffle.partitions" > - Use "repartition" and change the partition number in codes > - RangePartitioner doesn't generate deterministic partitions. Right now it's > safe as we disallow sort before aggregation. Not sure if we will add some > operators using RangePartitioner in future. > Fix: > Record # shuffle partition in offset log and enforce in next batch -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19873) If the user changes the shuffle partition number between batches, Streaming aggregation will fail.
[ https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-19873: - Description: If the user changes the shuffle partition number between batches, Streaming aggregation will fail. Here are some possible cases: - Change "spark.sql.shuffle.partitions" - Use "repartition" and change the partition number in codes - RangePartitioner doesn't generate deterministic partitions. Right now it's safe as we disallow sort before aggregation. Not sure if we will add some operators using RangePartitioner in future. Fix: Record # shuffle partitions in offset log and enforce in next batch was: If the user changes the shuffle partition number between batches, Streaming aggregation will fail. Here are some possible cases: - Change "spark.sql.shuffle.partitions" - Use "repartition" and change the partition number in codes - RangePartitioner doesn't generate deterministic partitions. Right now it's safe as we disallow sort before aggregation. Not sure if we will add some operators using RangePartitioner in future. Fix: Record # shuffle partition in offset log and enforce in next batch > If the user changes the shuffle partition number between batches, Streaming > aggregation will fail. > -- > > Key: SPARK-19873 > URL: https://issues.apache.org/jira/browse/SPARK-19873 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kunal Khamar > > If the user changes the shuffle partition number between batches, Streaming > aggregation will fail. > Here are some possible cases: > - Change "spark.sql.shuffle.partitions" > - Use "repartition" and change the partition number in codes > - RangePartitioner doesn't generate deterministic partitions. Right now it's > safe as we disallow sort before aggregation. Not sure if we will add some > operators using RangePartitioner in future. > Fix: > Record # shuffle partitions in offset log and enforce in next batch -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19873) If the user changes the shuffle partition number between batches, Streaming aggregation will fail.
[ https://issues.apache.org/jira/browse/SPARK-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-19873: - Description: It the user changes the shuffle partition number between batches, Streaming aggregation will fail. Here are some possible cases: - Change "spark.sql.shuffle.partitions" - Use "repartition" and change the partition number in codes - RangePartitioner doesn't generate deterministic partitions. Right now it's safe as we disallow sort before aggregation. Not sure if we will add some operators using RangePartitioner in future. Fix: Record # shuffle partition in offset log and enforce in next batch > If the user changes the shuffle partition number between batches, Streaming > aggregation will fail. > -- > > Key: SPARK-19873 > URL: https://issues.apache.org/jira/browse/SPARK-19873 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Kunal Khamar > > It the user changes the shuffle partition number between batches, Streaming > aggregation will fail. > Here are some possible cases: > - Change "spark.sql.shuffle.partitions" > - Use "repartition" and change the partition number in codes > - RangePartitioner doesn't generate deterministic partitions. Right now it's > safe as we disallow sort before aggregation. Not sure if we will add some > operators using RangePartitioner in future. > Fix: > Record # shuffle partition in offset log and enforce in next batch -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19873) If the user changes the shuffle partition number between batches, Streaming aggregation will fail.
Kunal Khamar created SPARK-19873: Summary: If the user changes the shuffle partition number between batches, Streaming aggregation will fail. Key: SPARK-19873 URL: https://issues.apache.org/jira/browse/SPARK-19873 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Kunal Khamar -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901974#comment-15901974 ] Brian Bruggeman commented on SPARK-19872: - This is a regression from spark 2.0.x. > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19481: - Fix Version/s: 2.0.3 > Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in > ClosureCleaner > - > > Key: SPARK-19481 > URL: https://issues.apache.org/jira/browse/SPARK-19481 > Project: Spark > Issue Type: Test > Components: Spark Shell >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the > tests unstable. See: > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Bruggeman updated SPARK-19872: Description: I'm receiving the following traceback: {code} >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte {code} I created a textfile (text.txt) with standard linux newlines: {code} a b d e f g h i j k l {code} I think ran pyspark: {code} $ pyspark Python 2.7.13 (default, Dec 18 2016, 07:03:39) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) SparkSession available as 'spark'. >>> sc.textFile('test.txt').collect() [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] >>> sc.textFile('test.txt', use_unicode=False).collect() ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte {code} This really looks like a bug in the `serializers.py` code. was: I'm receiving the following traceback: {{ >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte }} I created a textfile (text.txt) with standard linux newlines: {{ a b d e f g h i j k l }} I think ran pyspark: {{ $ pyspark Python 2.7.13 (default, Dec 18 2016, 07:03:39) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on
[jira] [Updated] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Bruggeman updated SPARK-19872: Description: I'm receiving the following traceback: {{ >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte }} I created a textfile (text.txt) with standard linux newlines: {{ a b d e f g h i j k l }} I think ran pyspark: {{ $ pyspark Python 2.7.13 (default, Dec 18 2016, 07:03:39) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) SparkSession available as 'spark'. >>> sc.textFile('test.txt').collect() [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] >>> sc.textFile('test.txt', use_unicode=False).collect() ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte }} This really looks like a bug in the `serializers.py` code. was: I'm receiving the following traceback: ``` >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte ``` I created a textfile (text.txt) with standard linux newlines: ``` a b d e f g h i j k l ``` I think ran pyspark: ``` $ pyspark Python 2.7.13 (default, Dec 18 2016, 07:03:39) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help",
[jira] [Created] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
Brian Bruggeman created SPARK-19872: --- Summary: UnicodeDecodeError in Pyspark on sc.textFile read with repartition Key: SPARK-19872 URL: https://issues.apache.org/jira/browse/SPARK-19872 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Environment: Mac and EC2 Reporter: Brian Bruggeman I'm receiving the following traceback: ``` >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte ``` I created a textfile (text.txt) with standard linux newlines: ``` a b d e f g h i j k l ``` I think ran pyspark: ``` $ pyspark Python 2.7.13 (default, Dec 18 2016, 07:03:39) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) SparkSession available as 'spark'. >>> sc.textFile('test.txt').collect() [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] >>> sc.textFile('test.txt', use_unicode=False).collect() ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte ``` This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19727) Spark SQL round function modifies original column
[ https://issues.apache.org/jira/browse/SPARK-19727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19727: --- Assignee: Wojciech Szymanski > Spark SQL round function modifies original column > - > > Key: SPARK-19727 > URL: https://issues.apache.org/jira/browse/SPARK-19727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sławomir Bogutyn >Assignee: Wojciech Szymanski >Priority: Minor > Fix For: 2.2.0 > > > {code:java} > import org.apache.spark.sql.functions > case class MyRow(value : BigDecimal) > val values = List(MyRow(BigDecimal.valueOf(1.23456789))) > val dataFrame = spark.createDataFrame(values) > dataFrame.show() > dataFrame.withColumn("value_rounded", > functions.round(dataFrame.col("value"))).show() > {code} > This produces output: > {noformat} > ++ > | value| > ++ > |1.2345678900| > ++ > ++-+ > | value|value_rounded| > ++-+ > |1.00|1| > ++-+ > {noformat} > Same problem occurs when I use round function to filter dataFrame. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19727) Spark SQL round function modifies original column
[ https://issues.apache.org/jira/browse/SPARK-19727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19727. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17075 [https://github.com/apache/spark/pull/17075] > Spark SQL round function modifies original column > - > > Key: SPARK-19727 > URL: https://issues.apache.org/jira/browse/SPARK-19727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sławomir Bogutyn >Priority: Minor > Fix For: 2.2.0 > > > {code:java} > import org.apache.spark.sql.functions > case class MyRow(value : BigDecimal) > val values = List(MyRow(BigDecimal.valueOf(1.23456789))) > val dataFrame = spark.createDataFrame(values) > dataFrame.show() > dataFrame.withColumn("value_rounded", > functions.round(dataFrame.col("value"))).show() > {code} > This produces output: > {noformat} > ++ > | value| > ++ > |1.2345678900| > ++ > ++-+ > | value|value_rounded| > ++-+ > |1.00|1| > ++-+ > {noformat} > Same problem occurs when I use round function to filter dataFrame. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18355) Spark SQL fails to read data from a ORC hive table that has a new column added to it
[ https://issues.apache.org/jira/browse/SPARK-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18355: -- Affects Version/s: 2.1.0 > Spark SQL fails to read data from a ORC hive table that has a new column > added to it > > > Key: SPARK-18355 > URL: https://issues.apache.org/jira/browse/SPARK-18355 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.1.0 > Environment: Centos6 >Reporter: Sandeep Nemuri > > *PROBLEM*: > Spark SQL fails to read data from a ORC hive table that has a new column > added to it. > Below is the exception: > {code} > scala> sqlContext.sql("select click_id,search_id from testorc").show > 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select > click_id,search_id from testorc > 16/11/03 22:17:54 INFO ParseDriver: Parse Completed > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > {code} > *STEPS TO SIMULATE THIS ISSUE*: > 1) Create table in hive. > {code} > CREATE TABLE `testorc`( > `click_id` string, > `search_id` string, > `uid` bigint) > PARTITIONED BY ( > `ts` string, > `hour` string) > STORED AS ORC; > {code} > 2) Load data into table : > {code} > INSERT INTO TABLE testorc PARTITION (ts = '98765',hour = '01' ) VALUES > (12,2,12345); > {code} > 3) Select through spark shell (This works) > {code} > sqlContext.sql("select click_id,search_id from testorc").show > {code} > 4) Now add column to hive table > {code} > ALTER TABLE testorc ADD COLUMNS (dummy string); > {code} > 5) Now again select from spark shell > {code} > scala> sqlContext.sql("select click_id,search_id from testorc").show > 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select > click_id,search_id from testorc > 16/11/03 22:17:54 INFO ParseDriver: Parse Completed > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at >
[jira] [Commented] (SPARK-18355) Spark SQL fails to read data from a ORC hive table that has a new column added to it
[ https://issues.apache.org/jira/browse/SPARK-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901914#comment-15901914 ] Dongjoon Hyun commented on SPARK-18355: --- I confirmed that this happens only with `spark.sql.hive.convertMetastoreOrc=true`. {code} scala> spark.version res0: String = 2.2.0-SNAPSHOT scala> sql("select click_id, search_id from testorc").show ++-+ |click_id|search_id| ++-+ | 12|2| ++-+ scala> sql("set spark.sql.hive.convertMetastoreOrc=true").show(false) +--+-+ |key |value| +--+-+ |spark.sql.hive.convertMetastoreOrc|true | +--+-+ scala> sql("select click_id, search_id from testorc").show 17/03/08 12:15:43 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.IllegalArgumentException: Field "click_id" does not exist. {code} > Spark SQL fails to read data from a ORC hive table that has a new column > added to it > > > Key: SPARK-18355 > URL: https://issues.apache.org/jira/browse/SPARK-18355 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 > Environment: Centos6 >Reporter: Sandeep Nemuri > > *PROBLEM*: > Spark SQL fails to read data from a ORC hive table that has a new column > added to it. > Below is the exception: > {code} > scala> sqlContext.sql("select click_id,search_id from testorc").show > 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select > click_id,search_id from testorc > 16/11/03 22:17:54 INFO ParseDriver: Parse Completed > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > {code} > *STEPS TO SIMULATE THIS ISSUE*: > 1) Create table in hive. > {code} > CREATE TABLE `testorc`( > `click_id` string, > `search_id` string, > `uid` bigint) > PARTITIONED BY ( > `ts` string, > `hour` string) > STORED AS ORC; > {code} > 2) Load data into table : > {code} > INSERT INTO TABLE testorc PARTITION (ts = '98765',hour = '01' ) VALUES > (12,2,12345); > {code} > 3) Select through spark shell (This works) > {code} > sqlContext.sql("select click_id,search_id from testorc").show > {code} > 4) Now add column to hive table > {code} > ALTER TABLE testorc ADD COLUMNS (dummy string); > {code} > 5) Now again select from spark shell > {code} > scala> sqlContext.sql("select click_id,search_id from testorc").show > 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select > click_id,search_id from testorc > 16/11/03 22:17:54 INFO ParseDriver: Parse Completed > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38) > at scala.Option.map(Option.scala:145) > at >
[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe
[ https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901769#comment-15901769 ] Owen O'Malley commented on SPARK-15474: --- Ok, Hive's use is fine because it gets the schema from the metastore and only matters for schema evolution, which isn't relevant if there are no rows. In fact, it gets worse in newer versions of Hive where the OrcOutputFormat will write 0 byte files and OrcInputFormat will ignore 0 bytes files for reading. (The reason behind needing the files at all are an interesting bit of Hive history, but not relevant for this.) The real fix is that Spark needs to use OrcFile.createWriter(...) API to write the files rather than Hive's OrcOutputFormat. The OrcFile API lets the caller set the schema directly. > ORC data source fails to write and read back empty dataframe > - > > Key: SPARK-15474 > URL: https://issues.apache.org/jira/browse/SPARK-15474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently ORC data source fails to write and read empty data. > The code below: > {code} > val emptyDf = spark.range(10).limit(0) > emptyDf.write > .format("orc") > .save(path.getCanonicalPath) > val copyEmptyDf = spark.read > .format("orc") > .load(path.getCanonicalPath) > copyEmptyDf.show() > {code} > throws an exception below: > {code} > Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114) > {code} > Note that this is a different case with the data below > {code} > val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) > {code} > In this case, any writer is not initialised and created. (no calls of > {{WriterContainer.writeRows()}}. > For Parquet and JSON, it works but ORC does not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19864) add makeQualifiedPath in SQLTestUtils to optimize some code
[ https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19864. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17204 [https://github.com/apache/spark/pull/17204] > add makeQualifiedPath in SQLTestUtils to optimize some code > --- > > Key: SPARK-19864 > URL: https://issues.apache.org/jira/browse/SPARK-19864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > Fix For: 2.2.0 > > > Currently there are lots of places to make the path qualified, it is better > to provide a function to do this, then the code will be more simple. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19864) add makeQualifiedPath in SQLTestUtils to optimize some code
[ https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19864: --- Assignee: Song Jun > add makeQualifiedPath in SQLTestUtils to optimize some code > --- > > Key: SPARK-19864 > URL: https://issues.apache.org/jira/browse/SPARK-19864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Assignee: Song Jun >Priority: Minor > Fix For: 2.2.0 > > > Currently there are lots of places to make the path qualified, it is better > to provide a function to do this, then the code will be more simple. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18209) More robust view canonicalization without full SQL expansion
[ https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18209. - Resolution: Fixed > More robust view canonicalization without full SQL expansion > > > Key: SPARK-18209 > URL: https://issues.apache.org/jira/browse/SPARK-18209 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > Spark SQL currently stores views by analyzing the provided SQL and then > generating fully expanded SQL out of the analyzed logical plan. This is > actually a very error prone way of doing it, because: > 1. It is non-trivial to guarantee that the generated SQL is correct without > being extremely verbose, given the current set of operators. > 2. We need extensive testing for all combination of operators. > 3. Whenever we introduce a new logical plan operator, we need to be super > careful because it might break SQL generation. This is the main reason > broadcast join hint has taken forever to be merged because it is very > difficult to guarantee correctness. > Given the two primary reasons to do view canonicalization is to provide the > context for the database as well as star expansion, I think we can this > through a simpler approach, by taking the user given SQL, analyze it, and > just wrap the original SQL with a SELECT clause at the outer and store the > database as a hint. > For example, given the following view creation SQL: > {code} > USE DATABASE my_db; > CREATE TABLE my_table (id int, name string); > CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10; > {code} > We store the following SQL instead: > {code} > SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE > id > 10); > {code} > During parsing time, we expand the view along using the provided database > context. > (We don't need to follow exactly the same hint, as I'm merely illustrating > the high level approach here.) > Note that there is a chance that the underlying base table(s)' schema change > and the stored schema of the view might differ from the actual SQL schema. In > that case, I think we should throw an exception at runtime to warn users. > This exception can be controlled by a flag. > Update 1: based on the discussion below, we don't even need to put the view > definition in a sub query. We can just add it via a logical plan at the end. > Update 2: we should make sure permanent views do not depend on temporary > objects (views, tables, or functions). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19871) Improve error message in verify_type to indicate which field the error is for
[ https://issues.apache.org/jira/browse/SPARK-19871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901668#comment-15901668 ] Len Frodgers commented on SPARK-19871: -- https://github.com/apache/spark/pull/17213 > Improve error message in verify_type to indicate which field the error is for > - > > Key: SPARK-19871 > URL: https://issues.apache.org/jira/browse/SPARK-19871 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Len Frodgers > > The error message for _verify_type is too vague. It should specify for which > field the type verification failed. > e.g. error: "This field is not nullable, but got None" – but what is the > field?! > In a dataframe with many non-nullable fields, it is a nightmare trying to > hunt down which field is null, since one has to resort to basic (and very > slow) trial and error. > I have happily created a PR to fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13740) add null check for _verify_type in types.py
[ https://issues.apache.org/jira/browse/SPARK-13740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901667#comment-15901667 ] Len Frodgers commented on SPARK-13740: -- https://github.com/apache/spark/pull/17213 > add null check for _verify_type in types.py > --- > > Key: SPARK-13740 > URL: https://issues.apache.org/jira/browse/SPARK-13740 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19871) Improve error message in verify_type to indicate which field the error is for
[ https://issues.apache.org/jira/browse/SPARK-19871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19871: Assignee: Apache Spark > Improve error message in verify_type to indicate which field the error is for > - > > Key: SPARK-19871 > URL: https://issues.apache.org/jira/browse/SPARK-19871 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Len Frodgers >Assignee: Apache Spark > > The error message for _verify_type is too vague. It should specify for which > field the type verification failed. > e.g. error: "This field is not nullable, but got None" – but what is the > field?! > In a dataframe with many non-nullable fields, it is a nightmare trying to > hunt down which field is null, since one has to resort to basic (and very > slow) trial and error. > I have happily created a PR to fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org