[jira] [Commented] (SPARK-23588) Add interpreted execution for CatalystToExternalMap expression

2018-03-11 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394827#comment-16394827
 ] 

Takeshi Yamamuro commented on SPARK-23588:
--

I'll make a pr later: 
https://github.com/apache/spark/compare/master...maropu:SPARK-23588

> Add interpreted execution for CatalystToExternalMap expression
> --
>
> Key: SPARK-23588
> URL: https://issues.apache.org/jira/browse/SPARK-23588
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Takeshi Yamamuro
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18134) SQL: MapType in Group BY and Joins not working

2018-03-11 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394787#comment-16394787
 ] 

Takeshi Yamamuro commented on SPARK-18134:
--

Sorry for interrupting you, but the current proposed pr 
(https://github.com/apache/spark/pull/19330#issuecomment-356787180) touched the 
stable interface in MapType, so it seems this ticket is weakly blocked until 
the next major release, I think.

> SQL: MapType in Group BY and Joins not working
> --
>
> Key: SPARK-18134
> URL: https://issues.apache.org/jira/browse/SPARK-18134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1, 
> 2.1.0
>Reporter: Christian Zorneck
>Priority: Major
>
> Since version 1.5 and issue SPARK-9415, MapTypes can no longer be used in 
> GROUP BY and join clauses. This makes it incompatible to HiveQL. So, a Hive 
> feature was removed from Spark. This makes Spark incompatible to various 
> HiveQL statements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23627) Provide isEmpty() function in DataSet

2018-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394782#comment-16394782
 ] 

Apache Spark commented on SPARK-23627:
--

User 'goungoun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20800

> Provide isEmpty() function in DataSet
> -
>
> Key: SPARK-23627
> URL: https://issues.apache.org/jira/browse/SPARK-23627
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Goun Na
>Priority: Trivial
>
> Like rdd.isEmpty, adding isEmpty to DataSet would useful. 
> Some code without isEmpty:
> {code:java}
> if (df.count = 0) { do_something }{code}
>  Some people add limit 1 for a performance reason:
> {code:java}
> if (df.limit(1).rdd.isEmpty) { do_something }
> if (df.rdd.take(1).isEmpty) { do_something }{code}
>  
> If isEmpty is provided, the code will be perfect clean:
> {code:java}
> if (df.isEmpty)  { do_something }{code}
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23634) AttributeReferences may be too conservative wrt nullability after optimization

2018-03-11 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394777#comment-16394777
 ] 

Takeshi Yamamuro commented on SPARK-23634:
--

Probably, this ticket is duplicate to 
https://issues.apache.org/jira/browse/SPARK-21351?
In the previous discussion (https://github.com/apache/spark/pull/18576), adding 
a new rule for this purpose, it seems to be some intrusive, so we need to look 
for other approaches to solve this, I think.

> AttributeReferences may be too conservative wrt nullability after optimization
> --
>
> Key: SPARK-23634
> URL: https://issues.apache.org/jira/browse/SPARK-23634
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Priority: Minor
>
> An {{AttributeReference}} effectively caches the nullability of its referent 
> when it is created. Some optimization rules can transform a nullable 
> attribute into a non-nullable one, but the references to it are not updated. 
> We could add a transformation rule that visits every {{AttributeReference}} 
> and fixes its nullability after optimization. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23635) Spark executor env variable is overwritten by same name AM env variable

2018-03-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23635:


Assignee: Apache Spark

> Spark executor env variable is overwritten by same name AM env variable
> ---
>
> Key: SPARK-23635
> URL: https://issues.apache.org/jira/browse/SPARK-23635
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> In the current Spark on YARN code, AM always will copy and overwrite its env 
> variables to executors, so we cannot set different values for executors.
> To reproduce issue, user could start spark-shell like:
> {code:java}
> ./bin/spark-shell --master yarn-client --conf 
> spark.executorEnv.SPARK_ABC=executor_val --conf  
> spark.yarn.appMasterEnv.SPARK_ABC=am_val
> {code}
> Then check executor env variables by
> {code:java}
> sc.parallelize(1 to 1).flatMap \{ i => sys.env.toSeq 
> }.collect.foreach(println)
> {code}
> You will always get {{am_val}} instead of {{executor_val}}. So we should not 
> let AM to overwrite specifically set executor env variables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23635) Spark executor env variable is overwritten by same name AM env variable

2018-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394767#comment-16394767
 ] 

Apache Spark commented on SPARK-23635:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/20799

> Spark executor env variable is overwritten by same name AM env variable
> ---
>
> Key: SPARK-23635
> URL: https://issues.apache.org/jira/browse/SPARK-23635
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>
> In the current Spark on YARN code, AM always will copy and overwrite its env 
> variables to executors, so we cannot set different values for executors.
> To reproduce issue, user could start spark-shell like:
> {code:java}
> ./bin/spark-shell --master yarn-client --conf 
> spark.executorEnv.SPARK_ABC=executor_val --conf  
> spark.yarn.appMasterEnv.SPARK_ABC=am_val
> {code}
> Then check executor env variables by
> {code:java}
> sc.parallelize(1 to 1).flatMap \{ i => sys.env.toSeq 
> }.collect.foreach(println)
> {code}
> You will always get {{am_val}} instead of {{executor_val}}. So we should not 
> let AM to overwrite specifically set executor env variables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23635) Spark executor env variable is overwritten by same name AM env variable

2018-03-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23635:


Assignee: (was: Apache Spark)

> Spark executor env variable is overwritten by same name AM env variable
> ---
>
> Key: SPARK-23635
> URL: https://issues.apache.org/jira/browse/SPARK-23635
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>
> In the current Spark on YARN code, AM always will copy and overwrite its env 
> variables to executors, so we cannot set different values for executors.
> To reproduce issue, user could start spark-shell like:
> {code:java}
> ./bin/spark-shell --master yarn-client --conf 
> spark.executorEnv.SPARK_ABC=executor_val --conf  
> spark.yarn.appMasterEnv.SPARK_ABC=am_val
> {code}
> Then check executor env variables by
> {code:java}
> sc.parallelize(1 to 1).flatMap \{ i => sys.env.toSeq 
> }.collect.foreach(println)
> {code}
> You will always get {{am_val}} instead of {{executor_val}}. So we should not 
> let AM to overwrite specifically set executor env variables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23645:


Assignee: Apache Spark

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Assignee: Apache Spark
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()
> TypeError: wrapper() got an unexpected keyword argument 'a'{code}
>  
>  
> *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF 
> to be called as such, but the cols tuple that gets passed in the call method:
> {code:java}
> _to_seq(sc, cols, _to_java_column{code}
>  has to be in the right order based on the functions defined argument inputs, 
> or the function will return incorrect results. so, the challenge here is to:
> (a) make sure to reconstruct the proper order of the full args/kwargs
> --> args first, and then kwargs (not in the order passed but in the order 
> requested by the fn)
> (b) handle python2 and python3 `inspect` module inconsistencies 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23645:


Assignee: (was: Apache Spark)

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()
> TypeError: wrapper() got an unexpected keyword argument 'a'{code}
>  
>  
> *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF 
> to be called as such, but the cols tuple that gets passed in the call method:
> {code:java}
> _to_seq(sc, cols, _to_java_column{code}
>  has to be in the right order based on the functions defined argument inputs, 
> or the function will return incorrect results. so, the challenge here is to:
> (a) make sure to reconstruct the proper order of the full args/kwargs
> --> args first, and then kwargs (not in the order passed but in the order 
> requested by the fn)
> (b) handle python2 and python3 `inspect` module inconsistencies 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394639#comment-16394639
 ] 

Apache Spark commented on SPARK-23645:
--

User 'mstewart141' has created a pull request for this issue:
https://github.com/apache/spark/pull/20798

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()
> TypeError: wrapper() got an unexpected keyword argument 'a'{code}
>  
>  
> *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF 
> to be called as such, but the cols tuple that gets passed in the call method:
> {code:java}
> _to_seq(sc, cols, _to_java_column{code}
>  has to be in the right order based on the functions defined argument inputs, 
> or the function will return incorrect results. so, the challenge here is to:
> (a) make sure to reconstruct the proper order of the full args/kwargs
> --> args first, and then kwargs (not in the order passed but in the order 
> requested by the fn)
> (b) handle python2 and python3 `inspect` module inconsistencies 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23583) Add interpreted execution to Invoke expression

2018-03-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23583:


Assignee: Apache Spark

> Add interpreted execution to Invoke expression
> --
>
> Key: SPARK-23583
> URL: https://issues.apache.org/jira/browse/SPARK-23583
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23583) Add interpreted execution to Invoke expression

2018-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394626#comment-16394626
 ] 

Apache Spark commented on SPARK-23583:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20797

> Add interpreted execution to Invoke expression
> --
>
> Key: SPARK-23583
> URL: https://issues.apache.org/jira/browse/SPARK-23583
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23583) Add interpreted execution to Invoke expression

2018-03-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23583:


Assignee: (was: Apache Spark)

> Add interpreted execution to Invoke expression
> --
>
> Key: SPARK-23583
> URL: https://issues.apache.org/jira/browse/SPARK-23583
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-11 Thread Deepansh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepansh updated SPARK-23650:
-
Description: 
For eg, I am getting streams from Kafka and I want to implement a model made in 
R for those streams. For this, I am using dapply.

My code is:

iris_model <- readRDS("./iris_model.rds")

randomBr <- SparkR:::broadcast(sc, iris_model)

kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
"localhost:9092", topic = "source")
lines<- select(kafka, cast(kafka$value, "string"))
schema<-schema(lines)

df1<-dapply(lines,function(x){
i_model<-SparkR:::value(randomMatBr)
for (row in 1:nrow(x))

{ y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
y<-toJSON(y) x[row,"value"] = y }

x
},schema)

Every time when Kafka streams are fetched the dapply method creates new runner 
thread and ships the variables again, which causes a huge lag(~2s for shipping 
model) every time. I even tried without broadcast variables but it takes same 
time to ship variables. Can some other techniques be applied to improve its 
performance?

  was:
For eg, I am getting streams from Kafka and I want to implement a model made in 
R for those streams. For this, I am using dapply.

My code is:

iris_model <- readRDS("./iris_model.rds")

randomBr <- SparkR:::broadcast(sc, iris_model)

kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
"localhost:9092", topic = "source")
lines<- select(kafka, cast(kafka$value, "string"))
schema<-schema(lines)

df1<-dapply(lines,function(x){
i_model<-SparkR:::value(randomMatBr)
for (row in 1:nrow(x))

{ y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(iris_model,y) 
y<-toJSON(y) x[row,"value"] = y }

x
},schema)

Every time when Kafka streams are fetched the dapply method creates new runner 
thread and ships the variables again, which causes a huge lag(~2s for shipping 
model) every time. I even tried without broadcast variables but it takes same 
time to ship variables. Can some other techniques be applied to improve its 
performance?


> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(i_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-11 Thread Deepansh (JIRA)
Deepansh created SPARK-23650:


 Summary: Slow SparkR udf (dapply)
 Key: SPARK-23650
 URL: https://issues.apache.org/jira/browse/SPARK-23650
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell, SparkR, Structured Streaming
Affects Versions: 2.2.0
Reporter: Deepansh


For eg, I am getting streams from Kafka and I want to implement a model made in 
R for those streams. For this, I am using dapply.

My code is:

iris_model <- readRDS("./iris_model.rds")

randomBr <- SparkR:::broadcast(sc, iris_model)

kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
"localhost:9092", topic = "source")
lines<- select(kafka, cast(kafka$value, "string"))
schema<-schema(lines)

df1<-dapply(lines,function(x){
i_model<-SparkR:::value(randomMatBr)
for (row in 1:nrow(x)) {
y<-fromJSON(as.character(x[row,"value"]))
y$predict=predict(iris_model,y)
y<-toJSON(y)
x[row,"value"] = y
}
x
},schema)

Every time when Kafka streams are fetched the dapply method creates new runner 
thread and ships the variables again, which causes a huge lag(~2s for shipping 
model) every time. I even tried without broadcast variables but it takes same 
time to ship variables. Can some other techniques be applied to improve its 
performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23650) Slow SparkR udf (dapply)

2018-03-11 Thread Deepansh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepansh updated SPARK-23650:
-
Description: 
For eg, I am getting streams from Kafka and I want to implement a model made in 
R for those streams. For this, I am using dapply.

My code is:

iris_model <- readRDS("./iris_model.rds")

randomBr <- SparkR:::broadcast(sc, iris_model)

kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
"localhost:9092", topic = "source")
lines<- select(kafka, cast(kafka$value, "string"))
schema<-schema(lines)

df1<-dapply(lines,function(x){
i_model<-SparkR:::value(randomMatBr)
for (row in 1:nrow(x))

{ y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(iris_model,y) 
y<-toJSON(y) x[row,"value"] = y }

x
},schema)

Every time when Kafka streams are fetched the dapply method creates new runner 
thread and ships the variables again, which causes a huge lag(~2s for shipping 
model) every time. I even tried without broadcast variables but it takes same 
time to ship variables. Can some other techniques be applied to improve its 
performance?

  was:
For eg, I am getting streams from Kafka and I want to implement a model made in 
R for those streams. For this, I am using dapply.

My code is:

iris_model <- readRDS("./iris_model.rds")

randomBr <- SparkR:::broadcast(sc, iris_model)

kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
"localhost:9092", topic = "source")
lines<- select(kafka, cast(kafka$value, "string"))
schema<-schema(lines)

df1<-dapply(lines,function(x){
i_model<-SparkR:::value(randomMatBr)
for (row in 1:nrow(x)) {
y<-fromJSON(as.character(x[row,"value"]))
y$predict=predict(iris_model,y)
y<-toJSON(y)
x[row,"value"] = y
}
x
},schema)

Every time when Kafka streams are fetched the dapply method creates new runner 
thread and ships the variables again, which causes a huge lag(~2s for shipping 
model) every time. I even tried without broadcast variables but it takes same 
time to ship variables. Can some other techniques be applied to improve its 
performance?


> Slow SparkR udf (dapply)
> 
>
> Key: SPARK-23650
> URL: https://issues.apache.org/jira/browse/SPARK-23650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Deepansh
>Priority: Major
>
> For eg, I am getting streams from Kafka and I want to implement a model made 
> in R for those streams. For this, I am using dapply.
> My code is:
> iris_model <- readRDS("./iris_model.rds")
> randomBr <- SparkR:::broadcast(sc, iris_model)
> kafka <- read.stream("kafka",subscribe = "source", kafka.bootstrap.servers = 
> "localhost:9092", topic = "source")
> lines<- select(kafka, cast(kafka$value, "string"))
> schema<-schema(lines)
> df1<-dapply(lines,function(x){
> i_model<-SparkR:::value(randomMatBr)
> for (row in 1:nrow(x))
> { y<-fromJSON(as.character(x[row,"value"])) y$predict=predict(iris_model,y) 
> y<-toJSON(y) x[row,"value"] = y }
> x
> },schema)
> Every time when Kafka streams are fetched the dapply method creates new 
> runner thread and ships the variables again, which causes a huge lag(~2s for 
> shipping model) every time. I even tried without broadcast variables but it 
> takes same time to ship variables. Can some other techniques be applied to 
> improve its performance?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23638) Spark on k8s: spark.kubernetes.initContainer.image has no effect

2018-03-11 Thread Anton Okolnychyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394615#comment-16394615
 ] 

Anton Okolnychyi commented on SPARK-23638:
--

Could you also share your spark-submit command? It looks you are specifying a 
custom docker image for the init container (as 
``spark.kubernetes.initContainer.image`` is different from 
``spark.kubernetes.container.image``). Are you sure you need a custom docker 
image for the init container?

In general, if you have a remote jar in --jars and specify 
``spark.kubernetes.container.image``, Spark will create an init container for 
you and you do not need to reason about it.

> Spark on k8s: spark.kubernetes.initContainer.image has no effect
> 
>
> Key: SPARK-23638
> URL: https://issues.apache.org/jira/browse/SPARK-23638
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: K8 server: Ubuntu 16.04
> Submission client: macOS Sierra 10.12.x
> Client Version: version.Info\{Major:"1", Minor:"9", GitVersion:"v1.9.3", 
> GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", 
> BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", 
> Platform:"darwin/amd64"}
> Server Version: version.Info\{Major:"1", Minor:"8", GitVersion:"v1.8.3", 
> GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", 
> BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", 
> Platform:"linux/amd64"}
>Reporter: maheshvra
>Priority: Major
>
> Hi all - I am trying to use initContainer to download remote dependencies. To 
> begin with, I ran a test with initContainer which basically "echo hello 
> world". However, when i triggered the pod deployment via spark-submit, I did 
> not see any trace of initContainer execution in my kubernetes cluster.
>  
> {code:java}
> SPARK_DRIVER_MEMORY: 1g 
> SPARK_DRIVER_CLASS: com.bigdata.App SPARK_DRIVER_ARGS: -c 
> /opt/spark/work-dir/app/main/environments/int -w 
> ./../../workflows/workflow_main.json -e prod -n features -v off 
> SPARK_DRIVER_BIND_ADDRESS:  
> SPARK_JAVA_OPT_0: -Dspark.submit.deployMode=cluster 
> SPARK_JAVA_OPT_1: -Dspark.driver.blockManager.port=7079 
> SPARK_JAVA_OPT_2: -Dspark.app.name=fg-am00-raw12 
> SPARK_JAVA_OPT_3: 
> -Dspark.kubernetes.container.image=docker.com/cmapp/fg-am00-raw:1.0.0 
> SPARK_JAVA_OPT_4: -Dspark.app.id=spark-4fa9a5ce1b1d401fa9c1e413ff030d44 
> SPARK_JAVA_OPT_5: 
> -Dspark.jars=/opt/spark/jars/aws-java-sdk-1.7.4.jar,/opt/spark/jars/hadoop-aws-2.7.3.jar,/opt/spark/jars/guava-14.0.1.jar,/opt/spark/jars/SparkApp.jar,/opt/spark/jars/datacleanup-component-1.0-SNAPSHOT.jar
>  
> SPARK_JAVA_OPT_6: -Dspark.driver.port=7078 
> SPARK_JAVA_OPT_7: 
> -Dspark.kubernetes.initContainer.image=docker.com/cmapp/custombusybox:1.0.0 
> SPARK_JAVA_OPT_8: 
> -Dspark.kubernetes.executor.podNamePrefix=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615
>  
> SPARK_JAVA_OPT_9: 
> -Dspark.kubernetes.driver.pod.name=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver
>  
> SPARK_JAVA_OPT_10: 
> -Dspark.driver.host=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver-svc.experimental.svc
>  SPARK_JAVA_OPT_11: -Dspark.executor.instances=5 
> SPARK_JAVA_OPT_12: 
> -Dspark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256 
> SPARK_JAVA_OPT_13: -Dspark.kubernetes.namespace=experimental 
> SPARK_JAVA_OPT_14: 
> -Dspark.kubernetes.authenticate.driver.serviceAccountName=experimental-service-account
>  SPARK_JAVA_OPT_15: -Dspark.master=k8s://https://bigdata
> {code}
>  
> Further, I did not see spec.initContainers section in the generated pod. 
> Please see the details below
>  
> {code:java}
>  
> {
> "kind": "Pod",
> "apiVersion": "v1",
> "metadata": {
> "name": "fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver",
> "namespace": "experimental",
> "selfLink": 
> "/api/v1/namespaces/experimental/pods/fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver",
> "uid": "adc5a50a-2342-11e8-87dc-12c5b3954044",
> "resourceVersion": "299054",
> "creationTimestamp": "2018-03-09T02:36:32Z",
> "labels": {
> "spark-app-selector": "spark-4fa9a5ce1b1d401fa9c1e413ff030d44",
> "spark-role": "driver"
> },
> "annotations": {
> "spark-app-name": "fg-am00-raw12"
> }
> },
> "spec": {
> "volumes": [
> {
> "name": "experimental-service-account-token-msmth",
> "secret": {
> "secretName": "experimental-service-account-token-msmth",
> "defaultMode": 420
> }
> }
> ],
> "containers": [
> {
> "name": "spark-kubernetes-driver",
> "image": "docker.com/cmapp/fg-am00-raw:1.0.0",
> "args": [
> "driver"
> ],
> "env": [
> {
> "name": "SPARK_DRIVER_MEMORY",
> "value": "1g"
> },
> {
> "name": "SPARK_DRIVER_CLASS",
> "value": "com.myapp.App"
> },
> {
> "name": "SPARK_DRIVER_ARGS",
> "value": "-c 

[jira] [Comment Edited] (SPARK-10815) Public API: Streaming Sources and Sinks

2018-03-11 Thread Richard Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394583#comment-16394583
 ] 

Richard Yu edited comment on SPARK-10815 at 3/11/18 5:42 PM:
-

[~c...@koeninger.org] Is exposing Sink Offsets still applicable to Spark?


was (Author: yohan123):
[~c...@koeninger.org] Is this still applicable to Spark? This Jira has been 
inactive for over a year.

> Public API: Streaming Sources and Sinks
> ---
>
> Key: SPARK-10815
> URL: https://issues.apache.org/jira/browse/SPARK-10815
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Critical
>
> The existing (in 2.0) source/sink interface for structured streaming depends 
> on RDDs. This dependency has two issues:
> 1. The RDD interface is wide and difficult to stabilize across versions. This 
> is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. 
> Ideally, a source/sink implementation created for Spark 2.x should work in 
> Spark 10.x, assuming the JVM is still around.
> 2. It is difficult to swap in/out a different execution engine.
> The purpose of this ticket is to create a stable interface that addresses the 
> above two.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10815) Public API: Streaming Sources and Sinks

2018-03-11 Thread Richard Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394583#comment-16394583
 ] 

Richard Yu commented on SPARK-10815:


[~c...@koeninger.org] Is this still applicable to Spark? This Jira has been 
inactive for over a year.

> Public API: Streaming Sources and Sinks
> ---
>
> Key: SPARK-10815
> URL: https://issues.apache.org/jira/browse/SPARK-10815
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Critical
>
> The existing (in 2.0) source/sink interface for structured streaming depends 
> on RDDs. This dependency has two issues:
> 1. The RDD interface is wide and difficult to stabilize across versions. This 
> is similar to point 1 in https://issues.apache.org/jira/browse/SPARK-15689. 
> Ideally, a source/sink implementation created for Spark 2.x should work in 
> Spark 10.x, assuming the JVM is still around.
> 2. It is difficult to swap in/out a different execution engine.
> The purpose of this ticket is to create a stable interface that addresses the 
> above two.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-03-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23649:


Assignee: Apache Spark

> CSV schema inferring fails on some UTF-8 chars
> --
>
> Key: SPARK-23649
> URL: https://issues.apache.org/jira/browse/SPARK-23649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
> Attachments: utf8xFF.csv
>
>
> Schema inferring of CSV files fails if the file contains a char starts from 
> *0xFF.* 
> {code:java}
> spark.read.option("header", "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 63
>   at 
> org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
>   at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
> {code}
> Here is content of the file:
> {code:java}
> hexdump -C ~/tmp/utf8xFF.csv
>   63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
> 0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
> 0020  2c 34 35 36 0d|,456.|
> 0025
> {code}
> Schema inferring doesn't fail in multiline mode:
> {code}
> spark.read.option("header", "true").option("multiline", 
> "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> +---+-+
> |channel|code
> +---+-+
> | United| 123
> | ABGUN�| 456
> +---+-+
> {code}
> and Spark is able to read the csv file if the schema is specified:
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType().add("channel", StringType).add("code", 
> StringType)
> spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
> {code}
> {code:java}
> +---++
> |channel|code|
> +---++
> | United| 123|
> | ABGUN�| 456|
> +---++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-03-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23649:


Assignee: (was: Apache Spark)

> CSV schema inferring fails on some UTF-8 chars
> --
>
> Key: SPARK-23649
> URL: https://issues.apache.org/jira/browse/SPARK-23649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf8xFF.csv
>
>
> Schema inferring of CSV files fails if the file contains a char starts from 
> *0xFF.* 
> {code:java}
> spark.read.option("header", "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 63
>   at 
> org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
>   at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
> {code}
> Here is content of the file:
> {code:java}
> hexdump -C ~/tmp/utf8xFF.csv
>   63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
> 0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
> 0020  2c 34 35 36 0d|,456.|
> 0025
> {code}
> Schema inferring doesn't fail in multiline mode:
> {code}
> spark.read.option("header", "true").option("multiline", 
> "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> +---+-+
> |channel|code
> +---+-+
> | United| 123
> | ABGUN�| 456
> +---+-+
> {code}
> and Spark is able to read the csv file if the schema is specified:
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType().add("channel", StringType).add("code", 
> StringType)
> spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
> {code}
> {code:java}
> +---++
> |channel|code|
> +---++
> | United| 123|
> | ABGUN�| 456|
> +---++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394582#comment-16394582
 ] 

Apache Spark commented on SPARK-23649:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20796

> CSV schema inferring fails on some UTF-8 chars
> --
>
> Key: SPARK-23649
> URL: https://issues.apache.org/jira/browse/SPARK-23649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf8xFF.csv
>
>
> Schema inferring of CSV files fails if the file contains a char starts from 
> *0xFF.* 
> {code:java}
> spark.read.option("header", "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 63
>   at 
> org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
>   at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
> {code}
> Here is content of the file:
> {code:java}
> hexdump -C ~/tmp/utf8xFF.csv
>   63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
> 0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
> 0020  2c 34 35 36 0d|,456.|
> 0025
> {code}
> Schema inferring doesn't fail in multiline mode:
> {code}
> spark.read.option("header", "true").option("multiline", 
> "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> +---+-+
> |channel|code
> +---+-+
> | United| 123
> | ABGUN�| 456
> +---+-+
> {code}
> and Spark is able to read the csv file if the schema is specified:
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType().add("channel", StringType).add("code", 
> StringType)
> spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
> {code}
> {code:java}
> +---++
> |channel|code|
> +---++
> | United| 123|
> | ABGUN�| 456|
> +---++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing

2018-03-11 Thread Richard Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394580#comment-16394580
 ] 

Richard Yu edited comment on SPARK-16859 at 3/11/18 5:26 PM:
-

Is this issue not already resolved?

When looking into {{JsonProtocol}} , I found that {{SparkListenerBlockUpdated}} 
was already included as a case in Serialization.

This was in spark-core_2.11.

 


was (Author: yohan123):
Is this issue not already resolved?

When looking into {{JsonProtocol}} , I found that {{SparkListenerBlockUpdated}} 
was already included as a case in Serialization.

This was in Scala-2.11.

 

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-03-11 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23649:
---
Shepherd: Herman van Hovell

> CSV schema inferring fails on some UTF-8 chars
> --
>
> Key: SPARK-23649
> URL: https://issues.apache.org/jira/browse/SPARK-23649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf8xFF.csv
>
>
> Schema inferring of CSV files fails if the file contains a char starts from 
> *0xFF.* 
> {code:java}
> spark.read.option("header", "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 63
>   at 
> org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
>   at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
> {code}
> Here is content of the file:
> {code:java}
> hexdump -C ~/tmp/utf8xFF.csv
>   63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
> 0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
> 0020  2c 34 35 36 0d|,456.|
> 0025
> {code}
> Schema inferring doesn't fail in multiline mode:
> {code}
> spark.read.option("header", "true").option("multiline", 
> "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> +---+-+
> |channel|code
> +---+-+
> | United| 123
> | ABGUN�| 456
> +---+-+
> {code}
> and Spark is able to read the csv file if the schema is specified:
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType().add("channel", StringType).add("code", 
> StringType)
> spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
> {code}
> {code:java}
> +---++
> |channel|code|
> +---++
> | United| 123|
> | ABGUN�| 456|
> +---++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16859) History Server storage information is missing

2018-03-11 Thread Richard Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394580#comment-16394580
 ] 

Richard Yu commented on SPARK-16859:


Is this issue not already resolved?

When looking into {{JsonProtocol}} , I found that {{SparkListenerBlockUpdated}} 
was already included as a case in Serialization.

 

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16859) History Server storage information is missing

2018-03-11 Thread Richard Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394580#comment-16394580
 ] 

Richard Yu edited comment on SPARK-16859 at 3/11/18 5:25 PM:
-

Is this issue not already resolved?

When looking into {{JsonProtocol}} , I found that {{SparkListenerBlockUpdated}} 
was already included as a case in Serialization.

This was in Scala-2.11.

 


was (Author: yohan123):
Is this issue not already resolved?

When looking into {{JsonProtocol}} , I found that {{SparkListenerBlockUpdated}} 
was already included as a case in Serialization.

 

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-03-11 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23649:
---
Attachment: utf8xFF.csv

> CSV schema inferring fails on some UTF-8 chars
> --
>
> Key: SPARK-23649
> URL: https://issues.apache.org/jira/browse/SPARK-23649
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf8xFF.csv
>
>
> Schema inferring of CSV files fails if the file contains a char starts from 
> *0xFF.* 
> {code:java}
> spark.read.option("header", "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 63
>   at 
> org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
>   at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
> {code}
> Here is content of the file:
> {code:java}
> hexdump -C ~/tmp/utf8xFF.csv
>   63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
> 0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
> 0020  2c 34 35 36 0d|,456.|
> 0025
> {code}
> Schema inferring doesn't fail in multiline mode:
> {code}
> spark.read.option("header", "true").option("multiline", 
> "true").csv("utf8xFF.csv")
> {code}
> {code:java}
> +---+-+
> |channel|code
> +---+-+
> | United| 123
> | ABGUN�| 456
> +---+-+
> {code}
> and Spark is able to read the csv file if the schema is specified:
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType().add("channel", StringType).add("code", 
> StringType)
> spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
> {code}
> {code:java}
> +---++
> |channel|code|
> +---++
> | United| 123|
> | ABGUN�| 456|
> +---++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars

2018-03-11 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-23649:
--

 Summary: CSV schema inferring fails on some UTF-8 chars
 Key: SPARK-23649
 URL: https://issues.apache.org/jira/browse/SPARK-23649
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Schema inferring of CSV files fails if the file contains a char starts from 
*0xFF.* 
{code:java}
spark.read.option("header", "true").csv("utf8xFF.csv")
{code}
{code:java}
java.lang.ArrayIndexOutOfBoundsException: 63
  at 
org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191)
  at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206)
{code}
Here is content of the file:
{code:java}
hexdump -C ~/tmp/utf8xFF.csv
  63 68 61 6e 6e 65 6c 2c  63 6f 64 65 0d 0a 55 6e  |channel,code..Un|
0010  69 74 65 64 2c 31 32 33  0d 0a 41 42 47 55 4e ff  |ited,123..ABGUN.|
0020  2c 34 35 36 0d|,456.|
0025
{code}
Schema inferring doesn't fail in multiline mode:
{code}
spark.read.option("header", "true").option("multiline", 
"true").csv("utf8xFF.csv")
{code}
{code:java}
+---+-+
|channel|code
+---+-+
| United| 123
| ABGUN�| 456
+---+-+
{code}
and Spark is able to read the csv file if the schema is specified:
{code}
import org.apache.spark.sql.types._
val schema = new StructType().add("channel", StringType).add("code", StringType)
spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show
{code}
{code:java}
+---++
|channel|code|
+---++
| United| 123|
| ABGUN�| 456|
+---++
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18134) SQL: MapType in Group BY and Joins not working

2018-03-11 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394571#comment-16394571
 ] 

Sital Kedia commented on SPARK-18134:
-

[~hvanhovell]- What is the state of this JIRA? Do we expect to make progress on 
this any time soon? This is a very critical feature we are missing in Spark 
which is blocking us from auto-migrating HiveQL to Spark. Please let us know if 
anything we can do to help resolve this issue.

 

cc - [~sameerag], [~tejasp]

> SQL: MapType in Group BY and Joins not working
> --
>
> Key: SPARK-18134
> URL: https://issues.apache.org/jira/browse/SPARK-18134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1, 
> 2.1.0
>Reporter: Christian Zorneck
>Priority: Major
>
> Since version 1.5 and issue SPARK-9415, MapTypes can no longer be used in 
> GROUP BY and join clauses. This makes it incompatible to HiveQL. So, a Hive 
> feature was removed from Spark. This makes Spark incompatible to various 
> HiveQL statements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2018-03-11 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394550#comment-16394550
 ] 

kevin yu commented on SPARK-19737:
--

[~LANDAIS Christophe], I submit a PR under  SPARK-23486, can you try and to see 
if it helps ?

 

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23647) extend hint syntax to support any expression for Python

2018-03-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23647:


Assignee: (was: Apache Spark)

> extend hint syntax to support any expression for Python
> ---
>
> Key: SPARK-23647
> URL: https://issues.apache.org/jira/browse/SPARK-23647
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Relax checks in
> [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23647) extend hint syntax to support any expression for Python

2018-03-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23647:


Assignee: Apache Spark

> extend hint syntax to support any expression for Python
> ---
>
> Key: SPARK-23647
> URL: https://issues.apache.org/jira/browse/SPARK-23647
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Dylan Guedes
>Assignee: Apache Spark
>Priority: Major
>
> Relax checks in
> [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23647) extend hint syntax to support any expression for Python

2018-03-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394499#comment-16394499
 ] 

Apache Spark commented on SPARK-23647:
--

User 'DylanGuedes' has created a pull request for this issue:
https://github.com/apache/spark/pull/20788

> extend hint syntax to support any expression for Python
> ---
>
> Key: SPARK-23647
> URL: https://issues.apache.org/jira/browse/SPARK-23647
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Relax checks in
> [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23648) extend hint syntax to support any expression for R

2018-03-11 Thread Dylan Guedes (JIRA)
Dylan Guedes created SPARK-23648:


 Summary: extend hint syntax to support any expression for R
 Key: SPARK-23648
 URL: https://issues.apache.org/jira/browse/SPARK-23648
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR, SQL
Affects Versions: 2.3.0, 2.2.0
Reporter: Dylan Guedes


Relax checks in
[https://github.com/apache/spark/blob/7f203a248f94df6183a4bc4642a3d873171fef29/R/pkg/R/DataFrame.R#L3746]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23647) extend hint syntax to support any expression for Python

2018-03-11 Thread Dylan Guedes (JIRA)
Dylan Guedes created SPARK-23647:


 Summary: extend hint syntax to support any expression for Python
 Key: SPARK-23647
 URL: https://issues.apache.org/jira/browse/SPARK-23647
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Dylan Guedes


Relax checks in

[https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23646) pyspark DataFrameWriter ignores customized settings?

2018-03-11 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23646.
--
Resolution: Invalid

Let me leave this resolved but please feel free to reopen when it becomes clear 
that it's an issue.

> pyspark DataFrameWriter ignores customized settings?
> 
>
> Key: SPARK-23646
> URL: https://issues.apache.org/jira/browse/SPARK-23646
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Chuan-Heng Hsiao
>Priority: Major
>
> I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode.
> (python version: 3.5.2 from ubuntu 16.04)
> I intended to have DataFrame write to hdfs with customized block-size but 
> failed.
> However, the corresponding rdd can successfully write with the customized 
> block-size.
>  
>  
> The following is the test code:
> (dfs.namenode.fs-limits.min-block-size has been set as 131072 in hdfs)
>  
>  
> ##
> # init
> ##from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
>  
> import hdfs
> from hdfs import InsecureClient
> import os
>  
> import numpy as np
> import pandas as pd
> import logging
>  
> os.environ['SPARK_HOME'] = '/opt/spark-2.2.1-bin-hadoop2.7'
>  
> block_size = 512 * 1024
>  
> conf = 
> SparkConf().setAppName("DCSSpark").setMaster("spark://spark1[:7077|http://10.7.34.47:7077/];).set('spark.cores.max',
>  20).set("spark.executor.cores", 10).set("spark.executor.memory", 
> "10g").set("spark.hadoop.dfs.blocksize", 
> str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size))
>  
> spark = SparkSession.builder.config(conf=conf).getOrCreate()
> spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.blocksize", 
> block_size)
> spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.block.size", 
> block_size)
>  
> ##
> # main
> ##
>  # create DataFrame
> df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"}, 
> \{'temp': "!"}])
>  
> # save using DataFrameWriter, resulting 128MB-block-size
> df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')
>  
> # save using rdd, resulting 512k-block-size
> client = InsecureClient('[http://spark1:50070|http://spark1:50070/]')
> client.delete('/tmp/temp_with_rrd', recursive=True)
> df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org