[jira] [Commented] (SPARK-23588) Add interpreted execution for CatalystToExternalMap expression
[ https://issues.apache.org/jira/browse/SPARK-23588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394827#comment-16394827 ] Takeshi Yamamuro commented on SPARK-23588: -- I'll make a pr later: https://github.com/apache/spark/compare/master...maropu:SPARK-23588 > Add interpreted execution for CatalystToExternalMap expression > -- > > Key: SPARK-23588 > URL: https://issues.apache.org/jira/browse/SPARK-23588 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Takeshi Yamamuro >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18134) SQL: MapType in Group BY and Joins not working
[ https://issues.apache.org/jira/browse/SPARK-18134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394787#comment-16394787 ] Takeshi Yamamuro commented on SPARK-18134: -- Sorry for interrupting you, but the current proposed pr (https://github.com/apache/spark/pull/19330#issuecomment-356787180) touched the stable interface in MapType, so it seems this ticket is weakly blocked until the next major release, I think. > SQL: MapType in Group BY and Joins not working > -- > > Key: SPARK-18134 > URL: https://issues.apache.org/jira/browse/SPARK-18134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1, > 2.1.0 >Reporter: Christian Zorneck >Priority: Major > > Since version 1.5 and issue SPARK-9415, MapTypes can no longer be used in > GROUP BY and join clauses. This makes it incompatible to HiveQL. So, a Hive > feature was removed from Spark. This makes Spark incompatible to various > HiveQL statements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23627) Provide isEmpty() function in DataSet
[ https://issues.apache.org/jira/browse/SPARK-23627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394782#comment-16394782 ] Apache Spark commented on SPARK-23627: -- User 'goungoun' has created a pull request for this issue: https://github.com/apache/spark/pull/20800 > Provide isEmpty() function in DataSet > - > > Key: SPARK-23627 > URL: https://issues.apache.org/jira/browse/SPARK-23627 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Goun Na >Priority: Trivial > > Like rdd.isEmpty, adding isEmpty to DataSet would useful. > Some code without isEmpty: > {code:java} > if (df.count = 0) { do_something }{code} > Some people add limit 1 for a performance reason: > {code:java} > if (df.limit(1).rdd.isEmpty) { do_something } > if (df.rdd.take(1).isEmpty) { do_something }{code} > > If isEmpty is provided, the code will be perfect clean: > {code:java} > if (df.isEmpty) { do_something }{code} > > > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23634) AttributeReferences may be too conservative wrt nullability after optimization
[ https://issues.apache.org/jira/browse/SPARK-23634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394777#comment-16394777 ] Takeshi Yamamuro commented on SPARK-23634: -- Probably, this ticket is duplicate to https://issues.apache.org/jira/browse/SPARK-21351? In the previous discussion (https://github.com/apache/spark/pull/18576), adding a new rule for this purpose, it seems to be some intrusive, so we need to look for other approaches to solve this, I think. > AttributeReferences may be too conservative wrt nullability after optimization > -- > > Key: SPARK-23634 > URL: https://issues.apache.org/jira/browse/SPARK-23634 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Priority: Minor > > An {{AttributeReference}} effectively caches the nullability of its referent > when it is created. Some optimization rules can transform a nullable > attribute into a non-nullable one, but the references to it are not updated. > We could add a transformation rule that visits every {{AttributeReference}} > and fixes its nullability after optimization. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23635) Spark executor env variable is overwritten by same name AM env variable
[ https://issues.apache.org/jira/browse/SPARK-23635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23635: Assignee: Apache Spark > Spark executor env variable is overwritten by same name AM env variable > --- > > Key: SPARK-23635 > URL: https://issues.apache.org/jira/browse/SPARK-23635 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > In the current Spark on YARN code, AM always will copy and overwrite its env > variables to executors, so we cannot set different values for executors. > To reproduce issue, user could start spark-shell like: > {code:java} > ./bin/spark-shell --master yarn-client --conf > spark.executorEnv.SPARK_ABC=executor_val --conf > spark.yarn.appMasterEnv.SPARK_ABC=am_val > {code} > Then check executor env variables by > {code:java} > sc.parallelize(1 to 1).flatMap \{ i => sys.env.toSeq > }.collect.foreach(println) > {code} > You will always get {{am_val}} instead of {{executor_val}}. So we should not > let AM to overwrite specifically set executor env variables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23635) Spark executor env variable is overwritten by same name AM env variable
[ https://issues.apache.org/jira/browse/SPARK-23635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394767#comment-16394767 ] Apache Spark commented on SPARK-23635: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/20799 > Spark executor env variable is overwritten by same name AM env variable > --- > > Key: SPARK-23635 > URL: https://issues.apache.org/jira/browse/SPARK-23635 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Minor > > In the current Spark on YARN code, AM always will copy and overwrite its env > variables to executors, so we cannot set different values for executors. > To reproduce issue, user could start spark-shell like: > {code:java} > ./bin/spark-shell --master yarn-client --conf > spark.executorEnv.SPARK_ABC=executor_val --conf > spark.yarn.appMasterEnv.SPARK_ABC=am_val > {code} > Then check executor env variables by > {code:java} > sc.parallelize(1 to 1).flatMap \{ i => sys.env.toSeq > }.collect.foreach(println) > {code} > You will always get {{am_val}} instead of {{executor_val}}. So we should not > let AM to overwrite specifically set executor env variables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23635) Spark executor env variable is overwritten by same name AM env variable
[ https://issues.apache.org/jira/browse/SPARK-23635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23635: Assignee: (was: Apache Spark) > Spark executor env variable is overwritten by same name AM env variable > --- > > Key: SPARK-23635 > URL: https://issues.apache.org/jira/browse/SPARK-23635 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Minor > > In the current Spark on YARN code, AM always will copy and overwrite its env > variables to executors, so we cannot set different values for executors. > To reproduce issue, user could start spark-shell like: > {code:java} > ./bin/spark-shell --master yarn-client --conf > spark.executorEnv.SPARK_ABC=executor_val --conf > spark.yarn.appMasterEnv.SPARK_ABC=am_val > {code} > Then check executor env variables by > {code:java} > sc.parallelize(1 to 1).flatMap \{ i => sys.env.toSeq > }.collect.foreach(println) > {code} > You will always get {{am_val}} instead of {{executor_val}}. So we should not > let AM to overwrite specifically set executor env variables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23645) pandas_udf can not be called with keyword arguments
[ https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23645: Assignee: Apache Spark > pandas_udf can not be called with keyword arguments > --- > > Key: SPARK-23645 > URL: https://issues.apache.org/jira/browse/SPARK-23645 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 > Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, > OpenJDK 64-Bit Server VM, 1.8.0_141 >Reporter: Stu (Michael Stewart) >Assignee: Apache Spark >Priority: Minor > > pandas_udf (all python udfs(?)) do not accept keyword arguments because > `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also > wrapper utility methods, that only accept args and not kwargs: > @ line 168: > {code:java} > ... > def __call__(self, *cols): > judf = self._judf > sc = SparkContext._active_spark_context > return Column(judf.apply(_to_seq(sc, cols, _to_java_column))) > # This function is for improving the online help system in the interactive > interpreter. > # For example, the built-in help / pydoc.help. It wraps the UDF with the > docstring and > # argument annotation. (See: SPARK-19161) > def _wrapped(self): > """ > Wrap this udf with a function and attach docstring from func > """ > # It is possible for a callable instance without __name__ attribute or/and > # __module__ attribute to be wrapped here. For example, > functools.partial. In this case, > # we should avoid wrapping the attributes from the wrapped function to > the wrapper > # function. So, we take out these attribute names from the default names > to set and > # then manually assign it after being wrapped. > assignments = tuple( > a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != > '__module__') > @functools.wraps(self.func, assigned=assignments) > def wrapper(*args): > return self(*args) > ...{code} > as seen in: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit > spark = SparkSession.builder.getOrCreate() > df = spark.range(12).withColumn('b', col('id') * 2) > def ok(a,b): return a*b > df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show() > # no problems > df.withColumn('ok', pandas_udf(f=ok, > returnType='bigint')(a='id',b='b')).show() # fail with ~no stacktrace thanks > to wrapper helper > --- > TypeError Traceback (most recent call last) > in () > > 1 df.withColumn('ok', pandas_udf(f=ok, > returnType='bigint')(a='id',b='b')).show() > TypeError: wrapper() got an unexpected keyword argument 'a'{code} > > > *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF > to be called as such, but the cols tuple that gets passed in the call method: > {code:java} > _to_seq(sc, cols, _to_java_column{code} > has to be in the right order based on the functions defined argument inputs, > or the function will return incorrect results. so, the challenge here is to: > (a) make sure to reconstruct the proper order of the full args/kwargs > --> args first, and then kwargs (not in the order passed but in the order > requested by the fn) > (b) handle python2 and python3 `inspect` module inconsistencies -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23645) pandas_udf can not be called with keyword arguments
[ https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23645: Assignee: (was: Apache Spark) > pandas_udf can not be called with keyword arguments > --- > > Key: SPARK-23645 > URL: https://issues.apache.org/jira/browse/SPARK-23645 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 > Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, > OpenJDK 64-Bit Server VM, 1.8.0_141 >Reporter: Stu (Michael Stewart) >Priority: Minor > > pandas_udf (all python udfs(?)) do not accept keyword arguments because > `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also > wrapper utility methods, that only accept args and not kwargs: > @ line 168: > {code:java} > ... > def __call__(self, *cols): > judf = self._judf > sc = SparkContext._active_spark_context > return Column(judf.apply(_to_seq(sc, cols, _to_java_column))) > # This function is for improving the online help system in the interactive > interpreter. > # For example, the built-in help / pydoc.help. It wraps the UDF with the > docstring and > # argument annotation. (See: SPARK-19161) > def _wrapped(self): > """ > Wrap this udf with a function and attach docstring from func > """ > # It is possible for a callable instance without __name__ attribute or/and > # __module__ attribute to be wrapped here. For example, > functools.partial. In this case, > # we should avoid wrapping the attributes from the wrapped function to > the wrapper > # function. So, we take out these attribute names from the default names > to set and > # then manually assign it after being wrapped. > assignments = tuple( > a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != > '__module__') > @functools.wraps(self.func, assigned=assignments) > def wrapper(*args): > return self(*args) > ...{code} > as seen in: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit > spark = SparkSession.builder.getOrCreate() > df = spark.range(12).withColumn('b', col('id') * 2) > def ok(a,b): return a*b > df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show() > # no problems > df.withColumn('ok', pandas_udf(f=ok, > returnType='bigint')(a='id',b='b')).show() # fail with ~no stacktrace thanks > to wrapper helper > --- > TypeError Traceback (most recent call last) > in () > > 1 df.withColumn('ok', pandas_udf(f=ok, > returnType='bigint')(a='id',b='b')).show() > TypeError: wrapper() got an unexpected keyword argument 'a'{code} > > > *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF > to be called as such, but the cols tuple that gets passed in the call method: > {code:java} > _to_seq(sc, cols, _to_java_column{code} > has to be in the right order based on the functions defined argument inputs, > or the function will return incorrect results. so, the challenge here is to: > (a) make sure to reconstruct the proper order of the full args/kwargs > --> args first, and then kwargs (not in the order passed but in the order > requested by the fn) > (b) handle python2 and python3 `inspect` module inconsistencies -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23645) pandas_udf can not be called with keyword arguments
[ https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394639#comment-16394639 ] Apache Spark commented on SPARK-23645: -- User 'mstewart141' has created a pull request for this issue: https://github.com/apache/spark/pull/20798 > pandas_udf can not be called with keyword arguments > --- > > Key: SPARK-23645 > URL: https://issues.apache.org/jira/browse/SPARK-23645 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 > Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, > OpenJDK 64-Bit Server VM, 1.8.0_141 >Reporter: Stu (Michael Stewart) >Priority: Minor > > pandas_udf (all python udfs(?)) do not accept keyword arguments because > `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also > wrapper utility methods, that only accept args and not kwargs: > @ line 168: > {code:java} > ... > def __call__(self, *cols): > judf = self._judf > sc = SparkContext._active_spark_context > return Column(judf.apply(_to_seq(sc, cols, _to_java_column))) > # This function is for improving the online help system in the interactive > interpreter. > # For example, the built-in help / pydoc.help. It wraps the UDF with the > docstring and > # argument annotation. (See: SPARK-19161) > def _wrapped(self): > """ > Wrap this udf with a function and attach docstring from func > """ > # It is possible for a callable instance without __name__ attribute or/and > # __module__ attribute to be wrapped here. For example, > functools.partial. In this case, > # we should avoid wrapping the attributes from the wrapped function to > the wrapper > # function. So, we take out these attribute names from the default names > to set and > # then manually assign it after being wrapped. > assignments = tuple( > a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != > '__module__') > @functools.wraps(self.func, assigned=assignments) > def wrapper(*args): > return self(*args) > ...{code} > as seen in: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit > spark = SparkSession.builder.getOrCreate() > df = spark.range(12).withColumn('b', col('id') * 2) > def ok(a,b): return a*b > df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show() > # no problems > df.withColumn('ok', pandas_udf(f=ok, > returnType='bigint')(a='id',b='b')).show() # fail with ~no stacktrace thanks > to wrapper helper > --- > TypeError Traceback (most recent call last) > in () > > 1 df.withColumn('ok', pandas_udf(f=ok, > returnType='bigint')(a='id',b='b')).show() > TypeError: wrapper() got an unexpected keyword argument 'a'{code} > > > *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF > to be called as such, but the cols tuple that gets passed in the call method: > {code:java} > _to_seq(sc, cols, _to_java_column{code} > has to be in the right order based on the functions defined argument inputs, > or the function will return incorrect results. so, the challenge here is to: > (a) make sure to reconstruct the proper order of the full args/kwargs > --> args first, and then kwargs (not in the order passed but in the order > requested by the fn) > (b) handle python2 and python3 `inspect` module inconsistencies -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23583) Add interpreted execution to Invoke expression
[ https://issues.apache.org/jira/browse/SPARK-23583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23583: Assignee: Apache Spark > Add interpreted execution to Invoke expression > -- > > Key: SPARK-23583 > URL: https://issues.apache.org/jira/browse/SPARK-23583 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23583) Add interpreted execution to Invoke expression
[ https://issues.apache.org/jira/browse/SPARK-23583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394626#comment-16394626 ] Apache Spark commented on SPARK-23583: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/20797 > Add interpreted execution to Invoke expression > -- > > Key: SPARK-23583 > URL: https://issues.apache.org/jira/browse/SPARK-23583 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23583) Add interpreted execution to Invoke expression
[ https://issues.apache.org/jira/browse/SPARK-23583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23583: Assignee: (was: Apache Spark) > Add interpreted execution to Invoke expression > -- > > Key: SPARK-23583 > URL: https://issues.apache.org/jira/browse/SPARK-23583 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23650) Slow SparkR udf (dapply)
[ https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepansh updated SPARK-23650: - Description: For eg, I am getting streams from Kafka and I want to implement a model made in R for those streams. For this, I am using dapply. My code is: iris_model <- readRDS("./iris_model.rds") randomBr <- SparkR:::broadcast(sc, iris_model) kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = "localhost:9092", topic = "source") lines<- select(kafka, cast(kafka$value, "string")) schema<-schema(lines) df1<-dapply(lines,function(x){ i_model<-SparkR:::value(randomMatBr) for (row in 1:nrow(x)) { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) y<-toJSON(y) x[row,"value"] = y } x },schema) Every time when Kafka streams are fetched the dapply method creates new runner thread and ships the variables again, which causes a huge lag(~2s for shipping model) every time. I even tried without broadcast variables but it takes same time to ship variables. Can some other techniques be applied to improve its performance? was: For eg, I am getting streams from Kafka and I want to implement a model made in R for those streams. For this, I am using dapply. My code is: iris_model <- readRDS("./iris_model.rds") randomBr <- SparkR:::broadcast(sc, iris_model) kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = "localhost:9092", topic = "source") lines<- select(kafka, cast(kafka$value, "string")) schema<-schema(lines) df1<-dapply(lines,function(x){ i_model<-SparkR:::value(randomMatBr) for (row in 1:nrow(x)) { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(iris_model,y) y<-toJSON(y) x[row,"value"] = y } x },schema) Every time when Kafka streams are fetched the dapply method creates new runner thread and ships the variables again, which causes a huge lag(~2s for shipping model) every time. I even tried without broadcast variables but it takes same time to ship variables. Can some other techniques be applied to improve its performance? > Slow SparkR udf (dapply) > > > Key: SPARK-23650 > URL: https://issues.apache.org/jira/browse/SPARK-23650 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Deepansh >Priority: Major > > For eg, I am getting streams from Kafka and I want to implement a model made > in R for those streams. For this, I am using dapply. > My code is: > iris_model <- readRDS("./iris_model.rds") > randomBr <- SparkR:::broadcast(sc, iris_model) > kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = > "localhost:9092", topic = "source") > lines<- select(kafka, cast(kafka$value, "string")) > schema<-schema(lines) > df1<-dapply(lines,function(x){ > i_model<-SparkR:::value(randomMatBr) > for (row in 1:nrow(x)) > { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) > y<-toJSON(y) x[row,"value"] = y } > x > },schema) > Every time when Kafka streams are fetched the dapply method creates new > runner thread and ships the variables again, which causes a huge lag(~2s for > shipping model) every time. I even tried without broadcast variables but it > takes same time to ship variables. Can some other techniques be applied to > improve its performance? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23650) Slow SparkR udf (dapply)
Deepansh created SPARK-23650: Summary: Slow SparkR udf (dapply) Key: SPARK-23650 URL: https://issues.apache.org/jira/browse/SPARK-23650 Project: Spark Issue Type: Improvement Components: Spark Shell, SparkR, Structured Streaming Affects Versions: 2.2.0 Reporter: Deepansh For eg, I am getting streams from Kafka and I want to implement a model made in R for those streams. For this, I am using dapply. My code is: iris_model <- readRDS("./iris_model.rds") randomBr <- SparkR:::broadcast(sc, iris_model) kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = "localhost:9092", topic = "source") lines<- select(kafka, cast(kafka$value, "string")) schema<-schema(lines) df1<-dapply(lines,function(x){ i_model<-SparkR:::value(randomMatBr) for (row in 1:nrow(x)) { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(iris_model,y) y<-toJSON(y) x[row,"value"] = y } x },schema) Every time when Kafka streams are fetched the dapply method creates new runner thread and ships the variables again, which causes a huge lag(~2s for shipping model) every time. I even tried without broadcast variables but it takes same time to ship variables. Can some other techniques be applied to improve its performance? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23650) Slow SparkR udf (dapply)
[ https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepansh updated SPARK-23650: - Description: For eg, I am getting streams from Kafka and I want to implement a model made in R for those streams. For this, I am using dapply. My code is: iris_model <- readRDS("./iris_model.rds") randomBr <- SparkR:::broadcast(sc, iris_model) kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = "localhost:9092", topic = "source") lines<- select(kafka, cast(kafka$value, "string")) schema<-schema(lines) df1<-dapply(lines,function(x){ i_model<-SparkR:::value(randomMatBr) for (row in 1:nrow(x)) { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(iris_model,y) y<-toJSON(y) x[row,"value"] = y } x },schema) Every time when Kafka streams are fetched the dapply method creates new runner thread and ships the variables again, which causes a huge lag(~2s for shipping model) every time. I even tried without broadcast variables but it takes same time to ship variables. Can some other techniques be applied to improve its performance? was: For eg, I am getting streams from Kafka and I want to implement a model made in R for those streams. For this, I am using dapply. My code is: iris_model <- readRDS("./iris_model.rds") randomBr <- SparkR:::broadcast(sc, iris_model) kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = "localhost:9092", topic = "source") lines<- select(kafka, cast(kafka$value, "string")) schema<-schema(lines) df1<-dapply(lines,function(x){ i_model<-SparkR:::value(randomMatBr) for (row in 1:nrow(x)) { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(iris_model,y) y<-toJSON(y) x[row,"value"] = y } x },schema) Every time when Kafka streams are fetched the dapply method creates new runner thread and ships the variables again, which causes a huge lag(~2s for shipping model) every time. I even tried without broadcast variables but it takes same time to ship variables. Can some other techniques be applied to improve its performance? > Slow SparkR udf (dapply) > > > Key: SPARK-23650 > URL: https://issues.apache.org/jira/browse/SPARK-23650 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Deepansh >Priority: Major > > For eg, I am getting streams from Kafka and I want to implement a model made > in R for those streams. For this, I am using dapply. > My code is: > iris_model <- readRDS("./iris_model.rds") > randomBr <- SparkR:::broadcast(sc, iris_model) > kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = > "localhost:9092", topic = "source") > lines<- select(kafka, cast(kafka$value, "string")) > schema<-schema(lines) > df1<-dapply(lines,function(x){ > i_model<-SparkR:::value(randomMatBr) > for (row in 1:nrow(x)) > { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(iris_model,y) > y<-toJSON(y) x[row,"value"] = y } > x > },schema) > Every time when Kafka streams are fetched the dapply method creates new > runner thread and ships the variables again, which causes a huge lag(~2s for > shipping model) every time. I even tried without broadcast variables but it > takes same time to ship variables. Can some other techniques be applied to > improve its performance? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23638) Spark on k8s: spark.kubernetes.initContainer.image has no effect
[ https://issues.apache.org/jira/browse/SPARK-23638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394615#comment-16394615 ] Anton Okolnychyi commented on SPARK-23638: -- Could you also share your spark-submit command? It looks you are specifying a custom docker image for the init container (as ``spark.kubernetes.initContainer.image`` is different from ``spark.kubernetes.container.image``). Are you sure you need a custom docker image for the init container? In general, if you have a remote jar in --jars and specify ``spark.kubernetes.container.image``, Spark will create an init container for you and you do not need to reason about it. > Spark on k8s: spark.kubernetes.initContainer.image has no effect > > > Key: SPARK-23638 > URL: https://issues.apache.org/jira/browse/SPARK-23638 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: K8 server: Ubuntu 16.04 > Submission client: macOS Sierra 10.12.x > Client Version: version.Info\{Major:"1", Minor:"9", GitVersion:"v1.9.3", > GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", > BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", > Platform:"darwin/amd64"} > Server Version: version.Info\{Major:"1", Minor:"8", GitVersion:"v1.8.3", > GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", > BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", > Platform:"linux/amd64"} >Reporter: maheshvra >Priority: Major > > Hi all - I am trying to use initContainer to download remote dependencies. To > begin with, I ran a test with initContainer which basically "echo hello > world". However, when i triggered the pod deployment via spark-submit, I did > not see any trace of initContainer execution in my kubernetes cluster. > > {code:java} > SPARK_DRIVER_MEMORY: 1g > SPARK_DRIVER_CLASS: com.bigdata.App SPARK_DRIVER_ARGS: -c > /opt/spark/work-dir/app/main/environments/int -w > ./../../workflows/workflow_main.json -e prod -n features -v off > SPARK_DRIVER_BIND_ADDRESS: > SPARK_JAVA_OPT_0: -Dspark.submit.deployMode=cluster > SPARK_JAVA_OPT_1: -Dspark.driver.blockManager.port=7079 > SPARK_JAVA_OPT_2: -Dspark.app.name=fg-am00-raw12 > SPARK_JAVA_OPT_3: > -Dspark.kubernetes.container.image=docker.com/cmapp/fg-am00-raw:1.0.0 > SPARK_JAVA_OPT_4: -Dspark.app.id=spark-4fa9a5ce1b1d401fa9c1e413ff030d44 > SPARK_JAVA_OPT_5: > -Dspark.jars=/opt/spark/jars/aws-java-sdk-1.7.4.jar,/opt/spark/jars/hadoop-aws-2.7.3.jar,/opt/spark/jars/guava-14.0.1.jar,/opt/spark/jars/SparkApp.jar,/opt/spark/jars/datacleanup-component-1.0-SNAPSHOT.jar > > SPARK_JAVA_OPT_6: -Dspark.driver.port=7078 > SPARK_JAVA_OPT_7: > -Dspark.kubernetes.initContainer.image=docker.com/cmapp/custombusybox:1.0.0 > SPARK_JAVA_OPT_8: > -Dspark.kubernetes.executor.podNamePrefix=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615 > > SPARK_JAVA_OPT_9: > -Dspark.kubernetes.driver.pod.name=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver > > SPARK_JAVA_OPT_10: > -Dspark.driver.host=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver-svc.experimental.svc > SPARK_JAVA_OPT_11: -Dspark.executor.instances=5 > SPARK_JAVA_OPT_12: > -Dspark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256 > SPARK_JAVA_OPT_13: -Dspark.kubernetes.namespace=experimental > SPARK_JAVA_OPT_14: > -Dspark.kubernetes.authenticate.driver.serviceAccountName=experimental-service-account > SPARK_JAVA_OPT_15: -Dspark.master=k8s://https://bigdata > {code} > > Further, I did not see spec.initContainers section in the generated pod. > Please see the details below > > {code:java} > > { > "kind": "Pod", > "apiVersion": "v1", > "metadata": { > "name": "fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver", > "namespace": "experimental", > "selfLink": > "/api/v1/namespaces/experimental/pods/fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver", > "uid": "adc5a50a-2342-11e8-87dc-12c5b3954044", > "resourceVersion": "299054", > "creationTimestamp": "2018-03-09T02:36:32Z", > "labels": { > "spark-app-selector": "spark-4fa9a5ce1b1d401fa9c1e413ff030d44", > "spark-role": "driver" > }, > "annotations": { > "spark-app-name": "fg-am00-raw12" > } > }, > "spec": { > "volumes": [ > { > "name": "experimental-service-account-token-msmth", > "secret": { > "secretName": "experimental-service-account-token-msmth", > "defaultMode": 420 > } > } > ], > "containers": [ > { > "name": "spark-kubernetes-driver", > "image": "docker.com/cmapp/fg-am00-raw:1.0.0", > "args": [ > "driver" > ], > "env": [ > { > "name": "SPARK_DRIVER_MEMORY", > "value": "1g" > }, > { > "name": "SPARK_DRIVER_CLASS", > "value": "com.myapp.App" > }, > { > "name": "SPARK_DRIVER_ARGS", > "value": "-c
[jira] [Comment Edited] (SPARK-10815) Public API: Streaming Sources and Sinks
[ https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394583#comment-16394583 ] Richard Yu edited comment on SPARK-10815 at 3/11/18 5:42 PM: - [~c...@koeninger.org] Is exposing Sink Offsets still applicable to Spark? was (Author: yohan123): [~c...@koeninger.org] Is this still applicable to Spark? This Jira has been inactive for over a year. > Public API: Streaming Sources and Sinks > --- > > Key: SPARK-10815 > URL: https://issues.apache.org/jira/browse/SPARK-10815 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Critical > > The existing (in 2.0) source/sink interface for structured streaming depends > on RDDs. This dependency has two issues: > 1. The RDD interface is wide and difficult to stabilize across versions. This > is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. > Ideally, a source/sink implementation created for Spark 2.x should work in > Spark 10.x, assuming the JVM is still around. > 2. It is difficult to swap in/out a different execution engine. > The purpose of this ticket is to create a stable interface that addresses the > above two. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10815) Public API: Streaming Sources and Sinks
[ https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394583#comment-16394583 ] Richard Yu commented on SPARK-10815: [~c...@koeninger.org] Is this still applicable to Spark? This Jira has been inactive for over a year. > Public API: Streaming Sources and Sinks > --- > > Key: SPARK-10815 > URL: https://issues.apache.org/jira/browse/SPARK-10815 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Critical > > The existing (in 2.0) source/sink interface for structured streaming depends > on RDDs. This dependency has two issues: > 1. The RDD interface is wide and difficult to stabilize across versions. This > is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. > Ideally, a source/sink implementation created for Spark 2.x should work in > Spark 10.x, assuming the JVM is still around. > 2. It is difficult to swap in/out a different execution engine. > The purpose of this ticket is to create a stable interface that addresses the > above two. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23649: Assignee: Apache Spark > CSV schema inferring fails on some UTF-8 chars > -- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 0020 2c 34 35 36 0d|,456.| > 0025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +---+-+ > |channel|code > +---+-+ > | United| 123 > | ABGUN�| 456 > +---+-+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +---++ > |channel|code| > +---++ > | United| 123| > | ABGUN�| 456| > +---++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23649: Assignee: (was: Apache Spark) > CSV schema inferring fails on some UTF-8 chars > -- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 0020 2c 34 35 36 0d|,456.| > 0025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +---+-+ > |channel|code > +---+-+ > | United| 123 > | ABGUN�| 456 > +---+-+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +---++ > |channel|code| > +---++ > | United| 123| > | ABGUN�| 456| > +---++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394582#comment-16394582 ] Apache Spark commented on SPARK-23649: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/20796 > CSV schema inferring fails on some UTF-8 chars > -- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 0020 2c 34 35 36 0d|,456.| > 0025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +---+-+ > |channel|code > +---+-+ > | United| 123 > | ABGUN�| 456 > +---+-+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +---++ > |channel|code| > +---++ > | United| 123| > | ABGUN�| 456| > +---++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing
[ https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394580#comment-16394580 ] Richard Yu edited comment on SPARK-16859 at 3/11/18 5:26 PM: - Is this issue not already resolved? When looking into {{JsonProtocol}} , I found that {{SparkListenerBlockUpdated}} was already included as a case in Serialization. This was in spark-core_2.11. was (Author: yohan123): Is this issue not already resolved? When looking into {{JsonProtocol}} , I found that {{SparkListenerBlockUpdated}} was already included as a case in Serialization. This was in Scala-2.11. > History Server storage information is missing > - > > Key: SPARK-16859 > URL: https://issues.apache.org/jira/browse/SPARK-16859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Andrei Ivanov >Priority: Major > Labels: historyserver, newbie > > It looks like job history storage tab in history server is broken for > completed jobs since *1.6.2*. > More specifically it's broken since > [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. > I've fixed for my installation by effectively reverting the above patch > ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). > IMHO, the most straightforward fix would be to implement > _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making > sure it works from _ReplayListenerBus_. > The downside will be that it will still work incorrectly with pre patch job > histories. But then, it doesn't work since *1.6.2* anyhow. > PS: I'd really love to have this fixed eventually. But I'm pretty new to > Apache Spark and missing hands on Scala experience. So I'd prefer that it be > fixed by someone experienced with roadmap vision. If nobody volunteers I'll > try to patch myself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-23649: --- Shepherd: Herman van Hovell > CSV schema inferring fails on some UTF-8 chars > -- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 0020 2c 34 35 36 0d|,456.| > 0025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +---+-+ > |channel|code > +---+-+ > | United| 123 > | ABGUN�| 456 > +---+-+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +---++ > |channel|code| > +---++ > | United| 123| > | ABGUN�| 456| > +---++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16859) History Server storage information is missing
[ https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394580#comment-16394580 ] Richard Yu commented on SPARK-16859: Is this issue not already resolved? When looking into {{JsonProtocol}} , I found that {{SparkListenerBlockUpdated}} was already included as a case in Serialization. > History Server storage information is missing > - > > Key: SPARK-16859 > URL: https://issues.apache.org/jira/browse/SPARK-16859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Andrei Ivanov >Priority: Major > Labels: historyserver, newbie > > It looks like job history storage tab in history server is broken for > completed jobs since *1.6.2*. > More specifically it's broken since > [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. > I've fixed for my installation by effectively reverting the above patch > ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). > IMHO, the most straightforward fix would be to implement > _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making > sure it works from _ReplayListenerBus_. > The downside will be that it will still work incorrectly with pre patch job > histories. But then, it doesn't work since *1.6.2* anyhow. > PS: I'd really love to have this fixed eventually. But I'm pretty new to > Apache Spark and missing hands on Scala experience. So I'd prefer that it be > fixed by someone experienced with roadmap vision. If nobody volunteers I'll > try to patch myself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing
[ https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394580#comment-16394580 ] Richard Yu edited comment on SPARK-16859 at 3/11/18 5:25 PM: - Is this issue not already resolved? When looking into {{JsonProtocol}} , I found that {{SparkListenerBlockUpdated}} was already included as a case in Serialization. This was in Scala-2.11. was (Author: yohan123): Is this issue not already resolved? When looking into {{JsonProtocol}} , I found that {{SparkListenerBlockUpdated}} was already included as a case in Serialization. > History Server storage information is missing > - > > Key: SPARK-16859 > URL: https://issues.apache.org/jira/browse/SPARK-16859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Andrei Ivanov >Priority: Major > Labels: historyserver, newbie > > It looks like job history storage tab in history server is broken for > completed jobs since *1.6.2*. > More specifically it's broken since > [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. > I've fixed for my installation by effectively reverting the above patch > ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). > IMHO, the most straightforward fix would be to implement > _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making > sure it works from _ReplayListenerBus_. > The downside will be that it will still work incorrectly with pre patch job > histories. But then, it doesn't work since *1.6.2* anyhow. > PS: I'd really love to have this fixed eventually. But I'm pretty new to > Apache Spark and missing hands on Scala experience. So I'd prefer that it be > fixed by someone experienced with roadmap vision. If nobody volunteers I'll > try to patch myself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-23649: --- Attachment: utf8xFF.csv > CSV schema inferring fails on some UTF-8 chars > -- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 0020 2c 34 35 36 0d|,456.| > 0025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +---+-+ > |channel|code > +---+-+ > | United| 123 > | ABGUN�| 456 > +---+-+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +---++ > |channel|code| > +---++ > | United| 123| > | ABGUN�| 456| > +---++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
Maxim Gekk created SPARK-23649: -- Summary: CSV schema inferring fails on some UTF-8 chars Key: SPARK-23649 URL: https://issues.apache.org/jira/browse/SPARK-23649 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Schema inferring of CSV files fails if the file contains a char starts from *0xFF.* {code:java} spark.read.option("header", "true").csv("utf8xFF.csv") {code} {code:java} java.lang.ArrayIndexOutOfBoundsException: 63 at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) {code} Here is content of the file: {code:java} hexdump -C ~/tmp/utf8xFF.csv 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| 0020 2c 34 35 36 0d|,456.| 0025 {code} Schema inferring doesn't fail in multiline mode: {code} spark.read.option("header", "true").option("multiline", "true").csv("utf8xFF.csv") {code} {code:java} +---+-+ |channel|code +---+-+ | United| 123 | ABGUN�| 456 +---+-+ {code} and Spark is able to read the csv file if the schema is specified: {code} import org.apache.spark.sql.types._ val schema = new StructType().add("channel", StringType).add("code", StringType) spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show {code} {code:java} +---++ |channel|code| +---++ | United| 123| | ABGUN�| 456| +---++ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18134) SQL: MapType in Group BY and Joins not working
[ https://issues.apache.org/jira/browse/SPARK-18134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394571#comment-16394571 ] Sital Kedia commented on SPARK-18134: - [~hvanhovell]- What is the state of this JIRA? Do we expect to make progress on this any time soon? This is a very critical feature we are missing in Spark which is blocking us from auto-migrating HiveQL to Spark. Please let us know if anything we can do to help resolve this issue. cc - [~sameerag], [~tejasp] > SQL: MapType in Group BY and Joins not working > -- > > Key: SPARK-18134 > URL: https://issues.apache.org/jira/browse/SPARK-18134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1, > 2.1.0 >Reporter: Christian Zorneck >Priority: Major > > Since version 1.5 and issue SPARK-9415, MapTypes can no longer be used in > GROUP BY and join clauses. This makes it incompatible to HiveQL. So, a Hive > feature was removed from Spark. This makes Spark incompatible to various > HiveQL statements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution
[ https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394550#comment-16394550 ] kevin yu commented on SPARK-19737: -- [~LANDAIS Christophe], I submit a PR under SPARK-23486, can you try and to see if it helps ? > New analysis rule for reporting unregistered functions without relying on > relation resolution > - > > Key: SPARK-19737 > URL: https://issues.apache.org/jira/browse/SPARK-19737 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Major > Fix For: 2.2.0 > > > Let's consider the following simple SQL query that reference an undefined > function {{foo}} that is never registered in the function registry: > {code:sql} > SELECT foo(a) FROM t > {code} > Assuming table {{t}} is a partitioned temporary view consisting of a large > number of files stored on S3, it may take the analyzer a long time before > realizing that {{foo}} is not registered yet. > The reason is that the existing analysis rule {{ResolveFunctions}} requires > all child expressions to be resolved first. Therefore, {{ResolveRelations}} > has to be executed first to resolve all columns referenced by the unresolved > function invocation. This further leads to partition discovery for {{t}}, > which may take a long time. > To address this case, we propose a new lightweight analysis rule > {{LookupFunctions}} that > # Matches all unresolved function invocations > # Look up the function names from the function registry > # Report analysis error for any unregistered functions > Since this rule doesn't actually try to resolve the unresolved functions, it > doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition > discovery. > We may put this analysis rule in a separate {{Once}} rule batch that sits > between the "Substitution" batch and the "Resolution" batch to avoid running > it repeatedly and make sure it gets executed before {{ResolveRelations}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23647) extend hint syntax to support any expression for Python
[ https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23647: Assignee: (was: Apache Spark) > extend hint syntax to support any expression for Python > --- > > Key: SPARK-23647 > URL: https://issues.apache.org/jira/browse/SPARK-23647 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Dylan Guedes >Priority: Major > > Relax checks in > [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23647) extend hint syntax to support any expression for Python
[ https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23647: Assignee: Apache Spark > extend hint syntax to support any expression for Python > --- > > Key: SPARK-23647 > URL: https://issues.apache.org/jira/browse/SPARK-23647 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Dylan Guedes >Assignee: Apache Spark >Priority: Major > > Relax checks in > [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23647) extend hint syntax to support any expression for Python
[ https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394499#comment-16394499 ] Apache Spark commented on SPARK-23647: -- User 'DylanGuedes' has created a pull request for this issue: https://github.com/apache/spark/pull/20788 > extend hint syntax to support any expression for Python > --- > > Key: SPARK-23647 > URL: https://issues.apache.org/jira/browse/SPARK-23647 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Dylan Guedes >Priority: Major > > Relax checks in > [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23648) extend hint syntax to support any expression for R
Dylan Guedes created SPARK-23648: Summary: extend hint syntax to support any expression for R Key: SPARK-23648 URL: https://issues.apache.org/jira/browse/SPARK-23648 Project: Spark Issue Type: Sub-task Components: SparkR, SQL Affects Versions: 2.3.0, 2.2.0 Reporter: Dylan Guedes Relax checks in [https://github.com/apache/spark/blob/7f203a248f94df6183a4bc4642a3d873171fef29/R/pkg/R/DataFrame.R#L3746] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23647) extend hint syntax to support any expression for Python
Dylan Guedes created SPARK-23647: Summary: extend hint syntax to support any expression for Python Key: SPARK-23647 URL: https://issues.apache.org/jira/browse/SPARK-23647 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 2.2.0 Reporter: Dylan Guedes Relax checks in [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23646) pyspark DataFrameWriter ignores customized settings?
[ https://issues.apache.org/jira/browse/SPARK-23646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23646. -- Resolution: Invalid Let me leave this resolved but please feel free to reopen when it becomes clear that it's an issue. > pyspark DataFrameWriter ignores customized settings? > > > Key: SPARK-23646 > URL: https://issues.apache.org/jira/browse/SPARK-23646 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Chuan-Heng Hsiao >Priority: Major > > I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode. > (python version: 3.5.2 from ubuntu 16.04) > I intended to have DataFrame write to hdfs with customized block-size but > failed. > However, the corresponding rdd can successfully write with the customized > block-size. > > > The following is the test code: > (dfs.namenode.fs-limits.min-block-size has been set as 131072 in hdfs) > > > ## > # init > ##from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > > import hdfs > from hdfs import InsecureClient > import os > > import numpy as np > import pandas as pd > import logging > > os.environ['SPARK_HOME'] = '/opt/spark-2.2.1-bin-hadoop2.7' > > block_size = 512 * 1024 > > conf = > SparkConf().setAppName("DCSSpark").setMaster("spark://spark1[:7077|http://10.7.34.47:7077/];).set('spark.cores.max', > 20).set("spark.executor.cores", 10).set("spark.executor.memory", > "10g").set("spark.hadoop.dfs.blocksize", > str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size)) > > spark = SparkSession.builder.config(conf=conf).getOrCreate() > spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.blocksize", > block_size) > spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.block.size", > block_size) > > ## > # main > ## > # create DataFrame > df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"}, > \{'temp': "!"}]) > > # save using DataFrameWriter, resulting 128MB-block-size > df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df') > > # save using rdd, resulting 512k-block-size > client = InsecureClient('[http://spark1:50070|http://spark1:50070/]') > client.delete('/tmp/temp_with_rrd', recursive=True) > df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd') -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org