date:20190502

[jira] [Resolved] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27612.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24519
[https://github.com/apache/spark/pull/24519]

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27612:


Assignee: Hyukjin Kwon

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: correctness
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27586) Improve binary comparison: replace Scala's for-comprehension if statements with while loop

2019-05-02 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27586:
--
Affects Version/s: (was: 2.4.2)
   3.0.0

> Improve binary comparison: replace Scala's for-comprehension if statements 
> with while loop
> --
>
> Key: SPARK-27586
> URL: https://issues.apache.org/jira/browse/SPARK-27586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: benchmark env:
>  * Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
>  * Linux 4.4.0-33.bm.1-amd64
>  * java version "1.8.0_131"
>  * Scala 2.11.8
>  * perf version 4.4.0
> Run:
> 40,000,000 times comparison on 32 bytes-length binary
>  
>Reporter: WoudyGao
>Assignee: WoudyGao
>Priority: Minor
> Fix For: 3.0.0
>
>
> I found the cpu cost of TypeUtils.compareBinary is noticeable when handle 
> some big parquet files;
> After some perf work, I found:
> the " for-comprehension if statements" will execute ≈15X instructions than 
> while loop
>  
> *'while-loop' version perf:*
>   
>  {{886.687949  task-clock (msec) #    1.257 CPUs 
> utilized}}
>  {{ 3,089  context-switches  #    0.003 M/sec}}
>  {{   265  cpu-migrations#    0.299 K/sec}}
>  {{12,227  page-faults   #    0.014 M/sec}}
>  {{ 2,209,183,920  cycles#    2.492 GHz}}
>  {{     stalled-cycles-frontend}}
>  {{     stalled-cycles-backend}}
>  {{ 6,865,836,114  instructions  #    3.11  insns per 
> cycle}}
>  {{ 1,568,910,228  branches  # 1769.405 M/sec}}
>  {{ 9,172,613  branch-misses #    0.58% of all 
> branches}}
>   
>  {{   0.705671157 seconds time elapsed}}
>   
> *TypeUtils.compareBinary perf:*
>  {{  16347.242313  task-clock (msec) #    1.233 CPUs 
> utilized}}
>  {{ 8,370  context-switches  #    0.512 K/sec}}
>  {{   481  cpu-migrations#    0.029 K/sec}}
>  {{   536,671  page-faults   #    0.033 M/sec}}
>  {{40,857,347,119  cycles#    2.499 GHz}}
>  {{     stalled-cycles-frontend}}
>  {{     stalled-cycles-backend}}
>  {{90,606,381,612  instructions  #    2.22  insns per 
> cycle}}
>  {{18,107,867,151  branches  # 1107.702 M/sec}}
>  {{12,880,296  branch-misses #    0.07% of all 
> branches}}
>   
>  {{  13.257617118 seconds time elapsed}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27586) Improve binary comparison: replace Scala's for-comprehension if statements with while loop

2019-05-02 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27586.
---
   Resolution: Fixed
 Assignee: WoudyGao
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24494

> Improve binary comparison: replace Scala's for-comprehension if statements 
> with while loop
> --
>
> Key: SPARK-27586
> URL: https://issues.apache.org/jira/browse/SPARK-27586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
> Environment: benchmark env:
>  * Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
>  * Linux 4.4.0-33.bm.1-amd64
>  * java version "1.8.0_131"
>  * Scala 2.11.8
>  * perf version 4.4.0
> Run:
> 40,000,000 times comparison on 32 bytes-length binary
>  
>Reporter: WoudyGao
>Assignee: WoudyGao
>Priority: Minor
> Fix For: 3.0.0
>
>
> I found the cpu cost of TypeUtils.compareBinary is noticeable when handle 
> some big parquet files;
> After some perf work, I found:
> the " for-comprehension if statements" will execute ≈15X instructions than 
> while loop
>  
> *'while-loop' version perf:*
>   
>  {{886.687949  task-clock (msec) #    1.257 CPUs 
> utilized}}
>  {{ 3,089  context-switches  #    0.003 M/sec}}
>  {{   265  cpu-migrations#    0.299 K/sec}}
>  {{12,227  page-faults   #    0.014 M/sec}}
>  {{ 2,209,183,920  cycles#    2.492 GHz}}
>  {{     stalled-cycles-frontend}}
>  {{     stalled-cycles-backend}}
>  {{ 6,865,836,114  instructions  #    3.11  insns per 
> cycle}}
>  {{ 1,568,910,228  branches  # 1769.405 M/sec}}
>  {{ 9,172,613  branch-misses #    0.58% of all 
> branches}}
>   
>  {{   0.705671157 seconds time elapsed}}
>   
> *TypeUtils.compareBinary perf:*
>  {{  16347.242313  task-clock (msec) #    1.233 CPUs 
> utilized}}
>  {{ 8,370  context-switches  #    0.512 K/sec}}
>  {{   481  cpu-migrations#    0.029 K/sec}}
>  {{   536,671  page-faults   #    0.033 M/sec}}
>  {{40,857,347,119  cycles#    2.499 GHz}}
>  {{     stalled-cycles-frontend}}
>  {{     stalled-cycles-backend}}
>  {{90,606,381,612  instructions  #    2.22  insns per 
> cycle}}
>  {{18,107,867,151  branches  # 1107.702 M/sec}}
>  {{12,880,296  branch-misses #    0.07% of all 
> branches}}
>   
>  {{  13.257617118 seconds time elapsed}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27612:


Assignee: Apache Spark

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27467) Upgrade Maven to 3.6.1

2019-05-02 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27467.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24481

> Upgrade Maven to 3.6.1
> --
>
> Key: SPARK-27467
> URL: https://issues.apache.org/jira/browse/SPARK-27467
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aim to upgrade Maven to 3.6.1 to bring JDK9+ patches like 
> MNG-6506. For the full release note, please see the following.
> https://maven.apache.org/docs/3.6.1/release-notes.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27612:


Assignee: (was: Apache Spark)

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Blocker
>  Labels: correctness
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27612:
-
Priority: Blocker  (was: Critical)

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Blocker
>  Labels: correctness
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27623) Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated

2019-05-02 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832140#comment-16832140
 ] 

Yuming Wang commented on SPARK-27623:
-

[~abarbulescu] We have a built-in AVRO data source implementation since 
SPARK-24768(2.4.0). Could you try to remove {{--jars Spark-Avro 2.4.0}} and 
load avro by {{spark.read.format("avro").load("/path/to/avro")}}

> Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> ---
>
> Key: SPARK-27623
> URL: https://issues.apache.org/jira/browse/SPARK-27623
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.2
>Reporter: Alexandru Barbulescu
>Priority: Critical
>
> After updating to spark 2.4.2 when using the 
> {code:java}
> spark.read.format().options().load()
> {code}
>  
> chain of methods, regardless of what parameter is passed to "format" we get 
> the following error related to avro:
>  
> {code:java}
> - .options(**load_options)
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 
> 172, in load
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
> deco
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load.
> - : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> - at java.util.ServiceLoader.fail(ServiceLoader.java:232)
> - at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
> - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
> - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
> - at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
> - at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
> - at scala.collection.Iterator.foreach(Iterator.scala:941)
> - at scala.collection.Iterator.foreach$(Iterator.scala:941)
> - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> - at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> - at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> - at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250)
> - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248)
> - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
> - at scala.collection.TraversableLike.filter(TraversableLike.scala:262)
> - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262)
> - at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
> - at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
> - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> - at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> - at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> - at java.lang.reflect.Method.invoke(Method.java:498)
> - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> - at py4j.Gateway.invoke(Gateway.java:282)
> - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> - at py4j.commands.CallCommand.execute(CallCommand.java:79)
> - at py4j.GatewayConnection.run(GatewayConnection.java:238)
> - at java.lang.Thread.run(Thread.java:748)
> - Caused by: java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/execution/datasources/FileFormat$class
> - at org.apache.spark.sql.avro.AvroFileFormat.(AvroFileFormat.scala:44)
> - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> - at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> - at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> - at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> - at java.lang.Class.newInstance(Class.java:442)
> - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
> - ... 29 more
> - Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.sql.execution.datasources.FileFormat$class
> - at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> - at

[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-05-02 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832136#comment-16832136
 ] 

Bryan Cutler commented on SPARK-27396:
--

The revisions sound good to me, it has a little more focus now and I agree it's 
better to handle exporting columnar data on the other SPIP. 

I have a question on the nice-to-have point of exposing data in Arrow format. 
You are meaning that the Arrow Java APIs will not be publicly exposed, only 
access to raw data I believe, so the user can process the data with Arrow 
without further conversion? If this is not done, then the user would have to do 
some minor conversions from Spark internal format to Arrow?

Also, how are the batch sizes set? Is it basically one batch per partition or 
can it be configured to break it up to smaller batches?

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *SPIP: Columnar Processing Without Arrow Formatting Guarantees.*
>  
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to
>  # Add to the current sql extensions mechanism so advanced users can have 
> access to the physical SparkPlan and manipulate it to provide columnar 
> processing for existing operators, including shuffle.  This will allow them 
> to implement their own cost based optimizers to decide when processing should 
> be columnar and when it should not.
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the users so operations that are not columnar see the 
> data as rows, and operations that are columnar see the data as columns.
>  
> Not Requirements, but things that would be nice to have.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  
> *Q2.* What problem is this proposal NOT designed to solve? 
> The goal of this is not for ML/AI but to provide APIs for accelerated 
> computing in Spark primarily targeting SQL/ETL like workloads.  ML/AI already 
> have several mechanisms to get data into/out of them. These can be improved 
> but will be covered in a separate SPIP.
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation.
> This does not cover exposing the underlying format of the data.  The only way 
> to get at the data in a ColumnVector is through the public APIs.  Exposing 
> the underlying format to improve efficiency will be covered in a separate 
> SPIP.
> This is not trying to implement new ways of transferring data to external 
> ML/AI applications.  That is covered by separate SPIPs already.
> This is not trying to add in generic code generation for columnar processing. 
>  Currently code generation for columnar processing is only supported when 
> translating columns to rows.  We will continue to support this, but will not 
> extend it as a general solution. That will be covered in a separate SPIP if 
> we find it is helpful.  For now columnar processing will be interpreted.
> This is not trying to expose a way to get columnar data into Spark through 
> DataSource V2 or any other similar API.  That would be covered by a separate 
> SPIP if we find it is needed.
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Internal implementations of FileFormats, optionally can return a 
> ColumnarBatch instead of rows.  The code generation phase knows how to take 
> that columnar data and iterate through it as rows for stages that wants rows, 
> which currently is almost everything.  The limitations here are mostly 
> implementation specific. The current standard is to abuse Scala’s type 
> erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The 
> code generation can handle this because it is generating java code, so it 
> bypasses scala’s type checking and just casts the InternalRow to the desired 
> ColumnarBatch.  This makes it difficult for others to implement the same 
>

[jira] [Resolved] (SPARK-27620) Update jetty to 9.4.18.v20190429

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27620.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24513
[https://github.com/apache/spark/pull/24513]

> Update jetty to 9.4.18.v20190429
> 
>
> Key: SPARK-27620
> URL: https://issues.apache.org/jira/browse/SPARK-27620
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: yuming.wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Update jetty to 9.4.18.v20190429 because of 
> [CVE-2019-10247|https://nvd.nist.gov/vuln/detail/CVE-2019-10247].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27620) Update jetty to 9.4.18.v20190429

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27620:


Assignee: yuming.wang

> Update jetty to 9.4.18.v20190429
> 
>
> Key: SPARK-27620
> URL: https://issues.apache.org/jira/browse/SPARK-27620
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: yuming.wang
>Priority: Major
>
> Update jetty to 9.4.18.v20190429 because of 
> [CVE-2019-10247|https://nvd.nist.gov/vuln/detail/CVE-2019-10247].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27627) Make option "pathGlobFilter" as a general option for all file sources

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27627:


Assignee: Apache Spark

> Make option "pathGlobFilter" as a general option for all file sources
> -
>
> Key: SPARK-27627
> URL: https://issues.apache.org/jira/browse/SPARK-27627
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Background:
> The data source option "pathGlobFilter" is introduced for Binary file format: 
> https://github.com/apache/spark/pull/24354 , which can be used for filtering 
> file names, e.g. reading "*.png" files only while there is "*.json" files in 
> the same directory.
> Proposal:
> Make the option "pathGlobFilter" as a general option for all file sources. 
> The path filtering should happen in the path globbing on Driver.
> Motivation:
> Filtering the file path names in file scan tasks on executors is kind of 
> ugly. 
> Impact:
> 1. The splitting of file partitions will be more balanced.
> 2. The metrics of file scan will be more accurate.
> 3. Users can use the option for reading other file sources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27627) Make option "pathGlobFilter" as a general option for all file sources

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27627:


Assignee: (was: Apache Spark)

> Make option "pathGlobFilter" as a general option for all file sources
> -
>
> Key: SPARK-27627
> URL: https://issues.apache.org/jira/browse/SPARK-27627
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Background:
> The data source option "pathGlobFilter" is introduced for Binary file format: 
> https://github.com/apache/spark/pull/24354 , which can be used for filtering 
> file names, e.g. reading "*.png" files only while there is "*.json" files in 
> the same directory.
> Proposal:
> Make the option "pathGlobFilter" as a general option for all file sources. 
> The path filtering should happen in the path globbing on Driver.
> Motivation:
> Filtering the file path names in file scan tasks on executors is kind of 
> ugly. 
> Impact:
> 1. The splitting of file partitions will be more balanced.
> 2. The metrics of file scan will be more accurate.
> 3. Users can use the option for reading other file sources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27627) Make option "pathGlobFilter" as a general option for all file sources

2019-05-02 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-27627:
--

 Summary: Make option "pathGlobFilter" as a general option for all 
file sources
 Key: SPARK-27627
 URL: https://issues.apache.org/jira/browse/SPARK-27627
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


Background:
The data source option "pathGlobFilter" is introduced for Binary file format: 
https://github.com/apache/spark/pull/24354 , which can be used for filtering 
file names, e.g. reading "*.png" files only while there is "*.json" files in 
the same directory.

Proposal:
Make the option "pathGlobFilter" as a general option for all file sources. The 
path filtering should happen in the path globbing on Driver.

Motivation:
Filtering the file path names in file scan tasks on executors is kind of ugly. 

Impact:
1. The splitting of file partitions will be more balanced.
2. The metrics of file scan will be more accurate.
3. Users can use the option for reading other file sources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27626) Fix `docker-image-tool.sh` to be robust in non-bash shell env

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27626:


Assignee: Apache Spark

> Fix `docker-image-tool.sh` to be robust in non-bash shell env
> -
>
> Key: SPARK-27626
> URL: https://issues.apache.org/jira/browse/SPARK-27626
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27626) Fix `docker-image-tool.sh` to be robust in non-bash shell env

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27626:


Assignee: (was: Apache Spark)

> Fix `docker-image-tool.sh` to be robust in non-bash shell env
> -
>
> Key: SPARK-27626
> URL: https://issues.apache.org/jira/browse/SPARK-27626
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27626) Fix `docker-image-tool.sh` to be robust in non-bash shell env

2019-05-02 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-27626:
-

 Summary: Fix `docker-image-tool.sh` to be robust in non-bash shell 
env
 Key: SPARK-27626
 URL: https://issues.apache.org/jira/browse/SPARK-27626
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27625) ScalaReflection.serializerFor fails for annotated types

2019-05-02 Thread Patrick Grandjean (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Grandjean updated SPARK-27625:
--
Description: 
ScalaRelfection.serializerFor fails for annotated type. Example:
{code:java}
case class Foo(
  field1: String,
  field2: Option[String] @Bar
)

val rdd: RDD[Foo] = ...
val ds = rdd.toDS // fails at runtime{code}
The stack trace:
{code:java}
// code placeholder
User class threw exception: scala.MatchError: scala.Option[String] @Bar (of 
class scala.reflect.internal.Types$AnnotatedType)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
at ...{code}
I believe that it would be safe to ignore the annotation.

  was:
ScalaRelfection.serializerFor fails for annotated type. Example:
{code:java}
case class Foo(
  field1: String,
  field2: Option[String] @Bar
)

val rdd: RDD[Foo] = ...
val ds = rdd.toDS // fails at runtime{code}
The stack trace:
{code:java}
// code placeholder
User class threw exception: scala.MatchError: scala.Option[String] @Bar (of 
class scala.reflect.internal.Types$AnnotatedType)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
at ...{code}


> ScalaReflection.serializerFor fails for annotated types
> ---
>
> Key: SPARK-27625
> URL: https://issues.apache.org/jira/browse/SPARK-27625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Patrick Grandjean
>Priority: Major
>
> ScalaRelfection.serializerFor fails for annotated type. Example:
> {code:java}
> case class Foo(
>   field1: String,
>   field2: Option[String] @Bar
> )
> val rdd: RDD[Foo] = ...
> val ds = rdd.toDS // fails at runtime{code}
> The stack trace:
> {code:java}
> // code placeholder
> User class threw exception: scala.MatchError: scala.Option[String] @Bar (of 
> class scala.reflect.internal.Types$AnnotatedType)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
> at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
> at ...{code}
> I believe that it would be safe to ignore the annotation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27625) ScalaReflection.serializerFor fails for annotated types

2019-05-02 Thread Patrick Grandjean (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Grandjean updated SPARK-27625:
--
Description: 
ScalaRelfection.serializerFor fails for annotated type. Example:
{code:java}
case class Foo(
  field1: String,
  field2: Option[String] @Bar
)

val rdd: RDD[Foo] = ...
val ds = rdd.toDS // fails at runtime{code}
The stack trace:
{code:java}
// code placeholder
User class threw exception: scala.MatchError: scala.Option[String] @Bar (of 
class scala.reflect.internal.Types$AnnotatedType)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
at ...{code}

  was:
ScalaRelfection.serializerFor fails for annotated type. Example:
{code:java}
case class Foo(
  field1: String,
  field2: Option[String] @Bar
)

val rdd: RDD[Foo] = ...
val ds = rdd.toDS // fails at runtime{code}
The stack trace:
User class threw exception: scala.MatchError: scala.Option[String] @Bar (of 
class scala.reflect.internal.Types$AnnotatedType)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
at


> ScalaReflection.serializerFor fails for annotated types
> ---
>
> Key: SPARK-27625
> URL: https://issues.apache.org/jira/browse/SPARK-27625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Patrick Grandjean
>Priority: Major
>
> ScalaRelfection.serializerFor fails for annotated type. Example:
> {code:java}
> case class Foo(
>   field1: String,
>   field2: Option[String] @Bar
> )
> val rdd: RDD[Foo] = ...
> val ds = rdd.toDS // fails at runtime{code}
> The stack trace:
> {code:java}
> // code placeholder
> User class threw exception: scala.MatchError: scala.Option[String] @Bar (of 
> class scala.reflect.internal.Types$AnnotatedType)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
> at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
> at ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27625) ScalaReflection.serializerFor fails for annotated types

2019-05-02 Thread Patrick Grandjean (JIRA)

Patrick Grandjean created SPARK-27625:
-

 Summary: ScalaReflection.serializerFor fails for annotated types
 Key: SPARK-27625
 URL: https://issues.apache.org/jira/browse/SPARK-27625
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: Patrick Grandjean


ScalaRelfection.serializerFor fails for annotated type. Example:
{code:java}
case class Foo(
  field1: String,
  field2: Option[String] @Bar
)

val rdd: RDD[Foo] = ...
val ds = rdd.toDS // fails at runtime{code}
The stack trace:
User class threw exception: scala.MatchError: scala.Option[String] @Bar (of 
class scala.reflect.internal.Types$AnnotatedType)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
at



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27601) Upgrade stream-lib to 2.9.6

2019-05-02 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27601:
-

Assignee: Yuming Wang
Priority: Minor  (was: Major)

BTW I wouldn't call these "Major", but it doesn't matter much

> Upgrade stream-lib to 2.9.6
> ---
>
> Key: SPARK-27601
> URL: https://issues.apache.org/jira/browse/SPARK-27601
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using 
> native arrays instead of ArrayLists to avoid boxing and unboxing of integers. 
> https://github.com/addthis/stream-lib/pull/98
> 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on 
> appropriately transformed encoded values, in this way boxing and unboxing of 
> integers is avoided. https://github.com/addthis/stream-lib/pull/97



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27601) Upgrade stream-lib to 2.9.6

2019-05-02 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27601.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24492
[https://github.com/apache/spark/pull/24492]

> Upgrade stream-lib to 2.9.6
> ---
>
> Key: SPARK-27601
> URL: https://issues.apache.org/jira/browse/SPARK-27601
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using 
> native arrays instead of ArrayLists to avoid boxing and unboxing of integers. 
> https://github.com/addthis/stream-lib/pull/98
> 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on 
> appropriately transformed encoded values, in this way boxing and unboxing of 
> integers is avoided. https://github.com/addthis/stream-lib/pull/97



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27598) DStreams checkpointing does not work with the Spark Shell

2019-05-02 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831861#comment-16831861
 ] 

Gabor Somogyi commented on SPARK-27598:
---

Shell works differently somehow. I've seen problems with avro roundtrip 
conversion as well.

> DStreams checkpointing does not work with the Spark Shell
> -
>
> Key: SPARK-27598
> URL: https://issues.apache.org/jira/browse/SPARK-27598
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> When I restarted a stream with checkpointing enabled I got this:
> {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
> file 
> [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
>  java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
> scala.Function1 in instance of 
> org.apache.spark.streaming.dstream.FileInputDStream
>  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {quote}
> It seems that the closure is stored in the Serialized format and cannot be 
> assigned back to a scala function1
> Details of how to reproduce it here: 
> [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]
> Maybe this is spark-shell specific and is not expected to work anyway, as I 
> dont see this to be an issues with a normal jar. 
> Note that with Spark 2.3.3 the error is different and this still does not 
> work but with a different error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27624) Fix CalenderInterval to show an empty interval correctly

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27624:


Assignee: Apache Spark

> Fix CalenderInterval to show an empty interval correctly
> 
>
> Key: SPARK-27624
> URL: https://issues.apache.org/jira/browse/SPARK-27624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.2, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> *BEFORE*
> {code}
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#0: timestamp, interval 1 microseconds
> +- StreamingRelation FileSource[/tmp/t], [ts#0]
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#3: timestamp, interval
> +- StreamingRelation FileSource[/tmp/t], [ts#3]
> {code}
> *AFTER*
> {code}
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#0: timestamp, interval 1 microseconds
> +- StreamingRelation FileSource[/tmp/t], [ts#0]
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#3: timestamp, interval 0 microseconds
> +- StreamingRelation FileSource[/tmp/t], [ts#3]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27624) Fix CalenderInterval to show an empty interval correctly

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27624:


Assignee: (was: Apache Spark)

> Fix CalenderInterval to show an empty interval correctly
> 
>
> Key: SPARK-27624
> URL: https://issues.apache.org/jira/browse/SPARK-27624
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.2, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> *BEFORE*
> {code}
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#0: timestamp, interval 1 microseconds
> +- StreamingRelation FileSource[/tmp/t], [ts#0]
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#3: timestamp, interval
> +- StreamingRelation FileSource[/tmp/t], [ts#3]
> {code}
> *AFTER*
> {code}
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#0: timestamp, interval 1 microseconds
> +- StreamingRelation FileSource[/tmp/t], [ts#0]
> scala> spark.readStream.schema("ts 
> timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain
> == Physical Plan ==
> EventTimeWatermark ts#3: timestamp, interval 0 microseconds
> +- StreamingRelation FileSource[/tmp/t], [ts#3]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27624) Fix CalenderInterval to show an empty interval correctly

2019-05-02 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-27624:
-

 Summary: Fix CalenderInterval to show an empty interval correctly
 Key: SPARK-27624
 URL: https://issues.apache.org/jira/browse/SPARK-27624
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.2, 2.3.3, 3.0.0
Reporter: Dongjoon Hyun


*BEFORE*
{code}
scala> spark.readStream.schema("ts 
timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain
== Physical Plan ==
EventTimeWatermark ts#0: timestamp, interval 1 microseconds
+- StreamingRelation FileSource[/tmp/t], [ts#0]

scala> spark.readStream.schema("ts 
timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain
== Physical Plan ==
EventTimeWatermark ts#3: timestamp, interval
+- StreamingRelation FileSource[/tmp/t], [ts#3]
{code}

*AFTER*
{code}
scala> spark.readStream.schema("ts 
timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain
== Physical Plan ==
EventTimeWatermark ts#0: timestamp, interval 1 microseconds
+- StreamingRelation FileSource[/tmp/t], [ts#0]

scala> spark.readStream.schema("ts 
timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain
== Physical Plan ==
EventTimeWatermark ts#3: timestamp, interval 0 microseconds
+- StreamingRelation FileSource[/tmp/t], [ts#3]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27194:


Assignee: Apache Spark

> Job failures when task attempts do not clean up spark-staging parquet files
> ---
>
> Key: SPARK-27194
> URL: https://issues.apache.org/jira/browse/SPARK-27194
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1, 2.3.2, 2.3.3
>Reporter: Reza Safi
>Assignee: Apache Spark
>Priority: Major
>
> When a container fails for some reason (for example when killed by yarn for 
> exceeding memory limits), the subsequent task attempts for the tasks that 
> were running on that container all fail with a FileAlreadyExistsException. 
> The original task attempt does not seem to successfully call abortTask (or at 
> least its "best effort" delete is unsuccessful) and clean up the parquet file 
> it was writing to, so when later task attempts try to write to the same 
> spark-staging directory using the same file name, the job fails.
> Here is what transpires in the logs:
> The container where task 200.0 is running is killed and the task is lost:
> {code}
> 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
> t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>  19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 
> 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 
> exited caused by one of the running tasks) Reason: Container killed by YARN 
> for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> {code}
> The task is re-attempted on a different executor and fails because the 
> part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
> already exists:
> {code}
> 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
> (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client a.b.c.d already exists
> {code}
> The job fails when the the configured task attempts (spark.task.maxFailures) 
> have failed with the same error:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 
> in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 
> 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  ...
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client i.p.a.d already exists
> {code}
> SPARK-26682 wasn't the root cause here, since there wasn't any stage 
> reattempt.
> This issue seems to happen when 
> spark.sql.sources.partitionOverwriteMode=dynamic. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27194:


Assignee: (was: Apache Spark)

> Job failures when task attempts do not clean up spark-staging parquet files
> ---
>
> Key: SPARK-27194
> URL: https://issues.apache.org/jira/browse/SPARK-27194
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1, 2.3.2, 2.3.3
>Reporter: Reza Safi
>Priority: Major
>
> When a container fails for some reason (for example when killed by yarn for 
> exceeding memory limits), the subsequent task attempts for the tasks that 
> were running on that container all fail with a FileAlreadyExistsException. 
> The original task attempt does not seem to successfully call abortTask (or at 
> least its "best effort" delete is unsuccessful) and clean up the parquet file 
> it was writing to, so when later task attempts try to write to the same 
> spark-staging directory using the same file name, the job fails.
> Here is what transpires in the logs:
> The container where task 200.0 is running is killed and the task is lost:
> {code}
> 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
> t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>  19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 
> 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 
> exited caused by one of the running tasks) Reason: Container killed by YARN 
> for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> {code}
> The task is re-attempted on a different executor and fails because the 
> part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
> already exists:
> {code}
> 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
> (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client a.b.c.d already exists
> {code}
> The job fails when the the configured task attempts (spark.task.maxFailures) 
> have failed with the same error:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 
> in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 
> 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  ...
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client i.p.a.d already exists
> {code}
> SPARK-26682 wasn't the root cause here, since there wasn't any stage 
> reattempt.
> This issue seems to happen when 
> spark.sql.sources.partitionOverwriteMode=dynamic. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2019-05-02 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-27194:
--

Assignee: (was: Marcelo Vanzin)

> Job failures when task attempts do not clean up spark-staging parquet files
> ---
>
> Key: SPARK-27194
> URL: https://issues.apache.org/jira/browse/SPARK-27194
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1, 2.3.2, 2.3.3
>Reporter: Reza Safi
>Priority: Major
>
> When a container fails for some reason (for example when killed by yarn for 
> exceeding memory limits), the subsequent task attempts for the tasks that 
> were running on that container all fail with a FileAlreadyExistsException. 
> The original task attempt does not seem to successfully call abortTask (or at 
> least its "best effort" delete is unsuccessful) and clean up the parquet file 
> it was writing to, so when later task attempts try to write to the same 
> spark-staging directory using the same file name, the job fails.
> Here is what transpires in the logs:
> The container where task 200.0 is running is killed and the task is lost:
> {code}
> 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
> t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>  19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 
> 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 
> exited caused by one of the running tasks) Reason: Container killed by YARN 
> for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> {code}
> The task is re-attempted on a different executor and fails because the 
> part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
> already exists:
> {code}
> 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
> (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client a.b.c.d already exists
> {code}
> The job fails when the the configured task attempts (spark.task.maxFailures) 
> have failed with the same error:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 
> in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 
> 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  ...
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client i.p.a.d already exists
> {code}
> SPARK-26682 wasn't the root cause here, since there wasn't any stage 
> reattempt.
> This issue seems to happen when 
> spark.sql.sources.partitionOverwriteMode=dynamic. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2019-05-02 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-27194:
--

Assignee: Marcelo Vanzin

> Job failures when task attempts do not clean up spark-staging parquet files
> ---
>
> Key: SPARK-27194
> URL: https://issues.apache.org/jira/browse/SPARK-27194
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1, 2.3.2, 2.3.3
>Reporter: Reza Safi
>Assignee: Marcelo Vanzin
>Priority: Major
>
> When a container fails for some reason (for example when killed by yarn for 
> exceeding memory limits), the subsequent task attempts for the tasks that 
> were running on that container all fail with a FileAlreadyExistsException. 
> The original task attempt does not seem to successfully call abortTask (or at 
> least its "best effort" delete is unsuccessful) and clean up the parquet file 
> it was writing to, so when later task attempts try to write to the same 
> spark-staging directory using the same file name, the job fails.
> Here is what transpires in the logs:
> The container where task 200.0 is running is killed and the task is lost:
> {code}
> 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
> t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>  19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 
> 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 
> exited caused by one of the running tasks) Reason: Container killed by YARN 
> for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> {code}
> The task is re-attempted on a different executor and fails because the 
> part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
> already exists:
> {code}
> 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
> (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client a.b.c.d already exists
> {code}
> The job fails when the the configured task attempts (spark.task.maxFailures) 
> have failed with the same error:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 
> in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 
> 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  ...
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client i.p.a.d already exists
> {code}
> SPARK-26682 wasn't the root cause here, since there wasn't any stage 
> reattempt.
> This issue seems to happen when 
> spark.sql.sources.partitionOverwriteMode=dynamic. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14083:


Assignee: (was: Apache Spark)

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14083:


Assignee: Apache Spark

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27619) MapType should be prohibited in hash expressions

2019-05-02 Thread Josh Rosen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27619:
---
Description: 
Spark currently allows MapType expressions to be used as input to hash 
expressions, but I think that this should be prohibited because Spark SQL does 
not support map equality.

Currently, Spark SQL's map hashcodes are sensitive to the insertion order of 
map elements:
{code:java}
val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)

// Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())

// In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code}
This behavior might be surprising to Scala developers.

I think there's precedence for banning the use of MapType here because we 
already prohibit MapType in aggregation / joins / equality comparisons 
(SPARK-9415) and set operations (SPARK-19893).

If we decide that we want this to be an error then it might also be a good idea 
to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the old and 
buggy behavior (in case applications were relying on it in cases where it just 
so happens to be safe-by-accident (e.g. maps which only have one entry)).

Alternatively, we could support hashing here if we implemented support for 
comparable map types (SPARK-18134).

  was:
Spark currently allows MapType expressions to be used as input to hash 
expressions, but I think that this should be prohibited because Spark SQL does 
not support map equality.

Currently, Spark SQL's map hashcodes are sensitive to the insertion order of 
map elements:
{code:java}
val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)

# Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())

# In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code}
This behavior might be surprising to Scala developers.

I think there's precedence for banning the use of MapType here because we 
already prohibit MapType in aggregation / joins / equality comparisons 
(SPARK-9415) and set operations (SPARK-19893).

If we decide that we want this to be an error then it might also be a good idea 
to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the old and 
buggy behavior (in case applications were relying on it in cases where it just 
so happens to be safe-by-accident (e.g. maps which only have one entry)).

Alternatively, we could support hashing here if we implemented support for 
comparable map types (SPARK-18134).


> MapType should be prohibited in hash expressions
> 
>
> Key: SPARK-27619
> URL: https://issues.apache.org/jira/browse/SPARK-27619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>  Labels: correctness
>
> Spark currently allows MapType expressions to be used as input to hash 
> expressions, but I think that this should be prohibited because Spark SQL 
> does not support map equality.
> Currently, Spark SQL's map hashcodes are sensitive to the insertion order of 
> map elements:
> {code:java}
> val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
> val b = spark.createDataset(Map(2->2, 1->1) :: Nil)
> // Demonstration of how Scala Map equality is unaffected by insertion order:
> assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
> assert(Map(1->1, 2->2) == Map(2->2, 1->1))
> assert(a.first() == b.first())
> // In contrast, this will print two different hashcodes:
> println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code}
> This behavior might be surprising to Scala developers.
> I think there's precedence for banning the use of MapType here because we 
> already prohibit MapType in aggregation / joins / equality comparisons 
> (SPARK-9415) and set operations (SPARK-19893).
> If we decide that we want this to be an error then it might also be a good 
> idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the 
> old and buggy behavior (in case applications were relying on it in cases 
> where it just so happens to be safe-by-accident (e.g. maps which only have 
> one entry)).
> Alternatively, we could support hashing here if we implemented support for 
> comparable map types (SPARK-18134).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SPARK-27619) MapType should be prohibited in hash expressions

2019-05-02 Thread Josh Rosen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27619:
---
Description: 
Spark currently allows MapType expressions to be used as input to hash 
expressions, but I think that this should be prohibited because Spark SQL does 
not support map equality.

Currently, Spark SQL's map hashcodes are sensitive to the insertion order of 
map elements:
{code:java}
val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)

# Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())

# In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code}
This behavior might be surprising to Scala developers.

I think there's precedence for banning the use of MapType here because we 
already prohibit MapType in aggregation / joins / equality comparisons 
(SPARK-9415) and set operations (SPARK-19893).

If we decide that we want this to be an error then it might also be a good idea 
to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the old and 
buggy behavior (in case applications were relying on it in cases where it just 
so happens to be safe-by-accident (e.g. maps which only have one entry)).

Alternatively, we could support hashing here if we implemented support for 
comparable map types (SPARK-18134).

  was:
Spark currently allows MapType expressions to be used as input to hash 
expressions, but I think that this should be prohibited because Spark SQL does 
not support map equality.

Currently, Spark SQL's map hashcodes are sensitive to the insertion order of 
map elements:
{code:java}
val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)

# Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())

# In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code}
This behavior might be surprising to Scala developers.

I think there's precedence for banning the use of MapType here because we 
already prohibit MapType in aggregation / joins / equality comparisons 
(SPARK-9415) and set operations (SPARK-19893).

Alternatively, we could support hashing here if we implemented support for 
comparable map types (SPARK-18134).


> MapType should be prohibited in hash expressions
> 
>
> Key: SPARK-27619
> URL: https://issues.apache.org/jira/browse/SPARK-27619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>  Labels: correctness
>
> Spark currently allows MapType expressions to be used as input to hash 
> expressions, but I think that this should be prohibited because Spark SQL 
> does not support map equality.
> Currently, Spark SQL's map hashcodes are sensitive to the insertion order of 
> map elements:
> {code:java}
> val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
> val b = spark.createDataset(Map(2->2, 1->1) :: Nil)
> # Demonstration of how Scala Map equality is unaffected by insertion order:
> assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
> assert(Map(1->1, 2->2) == Map(2->2, 1->1))
> assert(a.first() == b.first())
> # In contrast, this will print two different hashcodes:
> println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code}
> This behavior might be surprising to Scala developers.
> I think there's precedence for banning the use of MapType here because we 
> already prohibit MapType in aggregation / joins / equality comparisons 
> (SPARK-9415) and set operations (SPARK-19893).
> If we decide that we want this to be an error then it might also be a good 
> idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the 
> old and buggy behavior (in case applications were relying on it in cases 
> where it just so happens to be safe-by-accident (e.g. maps which only have 
> one entry)).
> Alternatively, we could support hashing here if we implemented support for 
> comparable map types (SPARK-18134).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27623) Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated

2019-05-02 Thread Alexandru Barbulescu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831684#comment-16831684
 ] 

Alexandru Barbulescu commented on SPARK-27623:
--

The problem might be related to the fact that in Spark 2.4.2, the pre-built 
convenience binaries are compiled for Scala 2.12. And the 
spark-cassandra-connector, that I also included, currently only supports 2.11. 

> Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> ---
>
> Key: SPARK-27623
> URL: https://issues.apache.org/jira/browse/SPARK-27623
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.2
>Reporter: Alexandru Barbulescu
>Priority: Critical
>
> After updating to spark 2.4.2 when using the 
> {code:java}
> spark.read.format().options().load()
> {code}
>  
> chain of methods, regardless of what parameter is passed to "format" we get 
> the following error related to avro:
>  
> {code:java}
> - .options(**load_options)
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 
> 172, in load
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
> deco
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load.
> - : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> - at java.util.ServiceLoader.fail(ServiceLoader.java:232)
> - at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
> - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
> - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
> - at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
> - at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
> - at scala.collection.Iterator.foreach(Iterator.scala:941)
> - at scala.collection.Iterator.foreach$(Iterator.scala:941)
> - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> - at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> - at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> - at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250)
> - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248)
> - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
> - at scala.collection.TraversableLike.filter(TraversableLike.scala:262)
> - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262)
> - at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
> - at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
> - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> - at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> - at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> - at java.lang.reflect.Method.invoke(Method.java:498)
> - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> - at py4j.Gateway.invoke(Gateway.java:282)
> - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> - at py4j.commands.CallCommand.execute(CallCommand.java:79)
> - at py4j.GatewayConnection.run(GatewayConnection.java:238)
> - at java.lang.Thread.run(Thread.java:748)
> - Caused by: java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/execution/datasources/FileFormat$class
> - at org.apache.spark.sql.avro.AvroFileFormat.(AvroFileFormat.scala:44)
> - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> - at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> - at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> - at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> - at java.lang.Class.newInstance(Class.java:442)
> - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
> - ... 29 more
> - Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.sql.execution.datasources.FileFormat$class
> - at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> -

[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831680#comment-16831680
 ] 

Liang-Chi Hsieh commented on SPARK-27612:
-

yeah, seems the issue is happened when python object gets pickled...

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Critical
>  Labels: correctness
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27607) Improve performance of Row.toString()

2019-05-02 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27607.
---
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24505

> Improve performance of Row.toString()
> -
>
> Key: SPARK-27607
> URL: https://issues.apache.org/jira/browse/SPARK-27607
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 3.0.0
>
>
> I have a job which ends up calling {{org.apache.spark.sql.Row.toString}} on 
> every row in a massive dataset (the reasons for this are slightly odd and 
> it's a bit non-trivial to change the job to avoid this step). 
> {{Row.toString}} is implemented by first constructing a WrappedArray 
> containing the Row's values (by calling {{toSeq}}) and then turning that 
> array into a string with {{mkString}}. We might be able to get a small 
> performance win by pipelining these steps, using an imperative loop to append 
> fields to a StringBuilder as soon as they're retrieved (thereby cutting out a 
> few layers of Scala collections indirection).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27623) Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated

2019-05-02 Thread Alexandru Barbulescu (JIRA)

Alexandru Barbulescu created SPARK-27623:


 Summary: Provider org.apache.spark.sql.avro.AvroFileFormat could 
not be instantiated
 Key: SPARK-27623
 URL: https://issues.apache.org/jira/browse/SPARK-27623
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.2
Reporter: Alexandru Barbulescu


After updating to spark 2.4.2 when using the 
{code:java}
spark.read.format().options().load()
{code}
 

chain of methods, regardless of what parameter is passed to "format" we get the 
following error related to avro:

 
{code:java}
- .options(**load_options)
- File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, 
in load
- File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
1257, in __call__
- File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
- File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, 
in get_return_value
- py4j.protocol.Py4JJavaError: An error occurred while calling o69.load.
- : java.util.ServiceConfigurationError: 
org.apache.spark.sql.sources.DataSourceRegister: Provider 
org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
- at java.util.ServiceLoader.fail(ServiceLoader.java:232)
- at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
- at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
- at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
- at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
- at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
- at scala.collection.Iterator.foreach(Iterator.scala:941)
- at scala.collection.Iterator.foreach$(Iterator.scala:941)
- at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
- at scala.collection.IterableLike.foreach(IterableLike.scala:74)
- at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
- at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
- at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250)
- at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248)
- at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
- at scala.collection.TraversableLike.filter(TraversableLike.scala:262)
- at scala.collection.TraversableLike.filter$(TraversableLike.scala:262)
- at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
- at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
- at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
- at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
- at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
- at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
- at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
- at java.lang.reflect.Method.invoke(Method.java:498)
- at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
- at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
- at py4j.Gateway.invoke(Gateway.java:282)
- at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
- at py4j.commands.CallCommand.execute(CallCommand.java:79)
- at py4j.GatewayConnection.run(GatewayConnection.java:238)
- at java.lang.Thread.run(Thread.java:748)
- Caused by: java.lang.NoClassDefFoundError: 
org/apache/spark/sql/execution/datasources/FileFormat$class
- at org.apache.spark.sql.avro.AvroFileFormat.(AvroFileFormat.scala:44)
- at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
- at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
- at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
- at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
- at java.lang.Class.newInstance(Class.java:442)
- at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
- ... 29 more
- Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.sql.execution.datasources.FileFormat$class
- at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
- at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
- at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
- ... 36 more

{code}
 

The code we run looks like this:

 
{code:java}
spark_session = (
 SparkSession.builder
 .appName(APPLICATION_NAME)
 .master(MASTER_URL)
 .config('spark.cassandra.connection.host', SERVER_IP_ADDRESS)
 .config('spark.cassandra.auth.username', CASSANDRA_USERNAME)
 .config('spark.cassandra.auth.password', CASSANDRA_PASSWORD)
 .config('spark.sql.shuffle.partitions', 16)
 .config('parquet.enable.summary-metadata', 'true')

[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831623#comment-16831623
 ] 

Hyukjin Kwon commented on SPARK-27612:
--

Argh, this happens after we upgraded the cloudpickle to 0.6.2 
https://github.com/apache/spark/commit/75ea89ad94ca76646e4697cf98c78d14c6e2695f#diff-19fd865e0dd0d7e6b04b3b1e047dcda7
Upgrading cloudpickle to 0.8.1 still doesn't solve the problem .. I think we 
should fix it in cloudpickle, I made a cloudpickle release and we port that 
change into Spark.

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Critical
>  Labels: correctness
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831602#comment-16831602
 ] 

Hyukjin Kwon commented on SPARK-27612:
--

Argh, seems to be a regression.

{code}
>>> from pyspark.sql.types import ArrayType, IntegerType
>>> df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), 
>>> True))
>>> df.distinct().collect()

[Row(value=[1, 2, 3, 4])]
{code}

Doesn't happen in Spark 2.4.1 and Spark 2.3.3

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23098) Migrate Kafka batch source to v2

2019-05-02 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831601#comment-16831601
 ] 

Gabor Somogyi commented on SPARK-23098:
---

[~DylanGuedes] are you working on this?


> Migrate Kafka batch source to v2
> 
>
> Key: SPARK-23098
> URL: https://issues.apache.org/jira/browse/SPARK-23098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27612:
-
Labels: correctness  (was: )

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Critical
>  Labels: correctness
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27612:
-
Priority: Critical  (was: Major)

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Critical
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831600#comment-16831600
 ] 

Liang-Chi Hsieh commented on SPARK-27612:
-

Yup, I can reproduce it too. No worry [~bryanc]. :)
Will take some time to look into it.



> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-02 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831596#comment-16831596
 ] 

Hyukjin Kwon commented on SPARK-27612:
--

haha, you're not crazy

{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
  /_/

Using Python version 3.7.3 (default, Mar 27 2019 09:23:15)
SparkSession available as 'spark'.
>>> from pyspark.sql.types import ArrayType, IntegerType
>>> df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), 
>>> True))
>>> df.distinct().collect()
[Row(value=[None, None]), Row(value=[1, 2, 3, 4])]
{code}

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26759) Arrow optimization in SparkR's interoperability

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26759.
--
Resolution: Done

> Arrow optimization in SparkR's interoperability
> ---
>
> Key: SPARK-26759
> URL: https://issues.apache.org/jira/browse/SPARK-26759
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Arrow 0.12.0 is release and it contains R API. We could optimize Spark 
> DaraFrame <> R DataFrame interoperability.
> For instance see the examples below:
>  - {{dapply}}
> {code:java}
> df <- createDataFrame(mtcars)
> collect(dapply(df,
>function(r.data.frame) {
>  data.frame(r.data.frame$gear)
>},
>structType("gear long")))
> {code}
>  - {{gapply}}
> {code:java}
> df <- createDataFrame(mtcars)
> collect(gapply(df,
>"gear",
>function(key, group) {
>  data.frame(gear = key[[1]], disp = mean(group$disp) > 
> group$disp)
>},
>structType("gear double, disp boolean")))
> {code}
>  - R DataFrame -> Spark DataFrame
> {code:java}
> createDataFrame(mtcars)
> {code}
>  - Spark DataFrame -> R DataFrame
> {code:java}
> collect(df)
> head(df)
> {code}
> Currently, some of communication path between R side and JVM side has to 
> buffer the data and flush it at once due to ARROW-4512. I don't target to fix 
> it under this umbrella.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26759) Arrow optimization in SparkR's interoperability

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26759:
-
Fix Version/s: 3.0.0

> Arrow optimization in SparkR's interoperability
> ---
>
> Key: SPARK-26759
> URL: https://issues.apache.org/jira/browse/SPARK-26759
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Arrow 0.12.0 is release and it contains R API. We could optimize Spark 
> DaraFrame <> R DataFrame interoperability.
> For instance see the examples below:
>  - {{dapply}}
> {code:java}
> df <- createDataFrame(mtcars)
> collect(dapply(df,
>function(r.data.frame) {
>  data.frame(r.data.frame$gear)
>},
>structType("gear long")))
> {code}
>  - {{gapply}}
> {code:java}
> df <- createDataFrame(mtcars)
> collect(gapply(df,
>"gear",
>function(key, group) {
>  data.frame(gear = key[[1]], disp = mean(group$disp) > 
> group$disp)
>},
>structType("gear double, disp boolean")))
> {code}
>  - R DataFrame -> Spark DataFrame
> {code:java}
> createDataFrame(mtcars)
> {code}
>  - Spark DataFrame -> R DataFrame
> {code:java}
> collect(df)
> head(df)
> {code}
> Currently, some of communication path between R side and JVM side has to 
> buffer the data and flush it at once due to ARROW-4512. I don't target to fix 
> it under this umbrella.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26924) Fix CRAN hack as soon as Arrow is available on CRAN

2019-05-02 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831594#comment-16831594
 ] 

Hyukjin Kwon commented on SPARK-26924:
--

I am going to make this as a standalone issue. I dont' think we can fix it 
within 3.0.0

> Fix CRAN hack as soon as Arrow is available on CRAN
> ---
>
> Key: SPARK-26924
> URL: https://issues.apache.org/jira/browse/SPARK-26924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Arrow optimization was added but Arrow is not available on CRAN.
> So, it had to add some hacks to avoid CRAN check in SparkR side. For example, 
> see 
> https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1
> These should be removed to properly check CRAN in SparkR
> See also ARROW-3204



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26924) Fix CRAN hack as soon as Arrow is available on CRAN

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26924:
-
Issue Type: Bug  (was: Sub-task)
Parent: (was: SPARK-26759)

> Fix CRAN hack as soon as Arrow is available on CRAN
> ---
>
> Key: SPARK-26924
> URL: https://issues.apache.org/jira/browse/SPARK-26924
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Arrow optimization was added but Arrow is not available on CRAN.
> So, it had to add some hacks to avoid CRAN check in SparkR side. For example, 
> see 
> https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1
> These should be removed to properly check CRAN in SparkR
> See also ARROW-3204



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26921) Document Arrow optimization and vectorized R APIs

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26921:
-
Summary: Document Arrow optimization and vectorized R APIs  (was: Fix CRAN 
hack as soon as Arrow is available on CRAN)

> Document Arrow optimization and vectorized R APIs
> -
>
> Key: SPARK-26921
> URL: https://issues.apache.org/jira/browse/SPARK-26921
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> We should update SparkR guide documentation, and some related documents, 
> comments like in {{SQLConf.scala}} when most of tasks are finished



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26924) Fix CRAN hack as soon as Arrow is available on CRAN

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26924:
-
Summary: Fix CRAN hack as soon as Arrow is available on CRAN  (was: 
Document Arrow optimization and vectorized R APIs)

> Fix CRAN hack as soon as Arrow is available on CRAN
> ---
>
> Key: SPARK-26924
> URL: https://issues.apache.org/jira/browse/SPARK-26924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Arrow optimization was added but Arrow is not available on CRAN.
> So, it had to add some hacks to avoid CRAN check in SparkR side. For example, 
> see 
> https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1
> These should be removed to properly check CRAN in SparkR
> See also ARROW-3204



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26921) Fix CRAN hack as soon as Arrow is available on CRAN

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26921:
-
Description: 
We should update SparkR guide documentation, and some related documents, 
comments like in {{SQLConf.scala}} when most of tasks are finished


  was:
Arrow optimization was added but Arrow is not available on CRAN.

So, it had to add some hacks to avoid CRAN check in SparkR side. For example, 
see 

https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1

These should be removed to properly check CRAN in SparkR

See also ARROW-3204



> Fix CRAN hack as soon as Arrow is available on CRAN
> ---
>
> Key: SPARK-26921
> URL: https://issues.apache.org/jira/browse/SPARK-26921
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> We should update SparkR guide documentation, and some related documents, 
> comments like in {{SQLConf.scala}} when most of tasks are finished



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26924) Document Arrow optimization and vectorized R APIs

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26924:
-
Description: 
Arrow optimization was added but Arrow is not available on CRAN.

So, it had to add some hacks to avoid CRAN check in SparkR side. For example, 
see 

https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1

These should be removed to properly check CRAN in SparkR

See also ARROW-3204


  was:We should update SparkR guide documentation, and some related documents, 
comments like in {{SQLConf.scala}} when most of tasks are finished.


> Document Arrow optimization and vectorized R APIs
> -
>
> Key: SPARK-26924
> URL: https://issues.apache.org/jira/browse/SPARK-26924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Arrow optimization was added but Arrow is not available on CRAN.
> So, it had to add some hacks to avoid CRAN check in SparkR side. For example, 
> see 
> https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1
> These should be removed to properly check CRAN in SparkR
> See also ARROW-3204



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27606) Deprecate `extended` field in ExpressionDescription/ExpressionInfo

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27606.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24500
[https://github.com/apache/spark/pull/24500]

> Deprecate `extended` field in ExpressionDescription/ExpressionInfo
> --
>
> Key: SPARK-27606
> URL: https://issues.apache.org/jira/browse/SPARK-27606
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> As of SPARK-21485 and SPARK-27328, we have nicer way to separately describe 
> extended usages.
> `extended` field and method at ExpressionDescription/ExpressionInfo is now 
> pretty useless.
> This Jira targets to deprecate it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27606) Deprecate `extended` field in ExpressionDescription/ExpressionInfo

2019-05-02 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27606:


Assignee: Hyukjin Kwon

> Deprecate `extended` field in ExpressionDescription/ExpressionInfo
> --
>
> Key: SPARK-27606
> URL: https://issues.apache.org/jira/browse/SPARK-27606
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> As of SPARK-21485 and SPARK-27328, we have nicer way to separately describe 
> extended usages.
> `extended` field and method at ExpressionDescription/ExpressionInfo is now 
> pretty useless.
> This Jira targets to deprecate it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27622) Avoiding network communication when block mangers are running on the host

2019-05-02 Thread Attila Zsolt Piros (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831553#comment-16831553
 ] 

Attila Zsolt Piros commented on SPARK-27622:


I am already working on this. A working prototype for RDD blocks are ready and 
working.

> Avoiding network communication when block mangers are running on the host 
> --
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27622) Avoiding network communication when block mangers are running on the host

2019-05-02 Thread Attila Zsolt Piros (JIRA)

Attila Zsolt Piros created SPARK-27622:
--

 Summary: Avoiding network communication when block mangers are 
running on the host 
 Key: SPARK-27622
 URL: https://issues.apache.org/jira/browse/SPARK-27622
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


Currently fetching blocks always uses the network even when the two block 
managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException

2019-05-02 Thread Anca Sarb (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anca Sarb updated SPARK-27621:
--
Comment: was deleted

(was: I've created a PR with the fix here 
[https://github.com/apache/spark/pull/24509])

> Calling transform() method on a LinearRegressionModel throws 
> NoSuchElementException
> ---
>
> Key: SPARK-27621
> URL: https://issues.apache.org/jira/browse/SPARK-27621
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2
>Reporter: Anca Sarb
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When transform(...) method is called on a LinearRegressionModel created 
> directly with the coefficients and intercepts, the following exception is 
> encountered.
> {code:java}
> java.util.NoSuchElementException: Failed to find a default value for loss at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at 
> org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at 
> org.apache.spark.ml.param.Params$class.$(params.scala:786) at 
> org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at 
> org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
>  at 
> org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
>  at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) 
> at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at 
> org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at 
> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at 
> org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
> {code}
> This is because validateAndTransformSchema() is called both during training 
> and scoring phases, but the checks against the training related params like 
> loss should really be performed during training phase only, I think, please 
> correct me if I'm missing anything.
> This issue was first reported for mleap 
> ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because 
> basically when we serialize the Spark transformers for mleap, we only 
> serialize the params that are relevant for scoring. We do have the option to 
> de-serialize the serialized transformers back into Spark for scoring again, 
> but in that case, we no longer have all the training params.
> Test to reproduce in PR: [https://github.com/apache/spark/pull/24509]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException

2019-05-02 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831502#comment-16831502
 ] 

Apache Spark commented on SPARK-27621:
--

User 'ancasarb' has created a pull request for this issue:
https://github.com/apache/spark/pull/24509

> Calling transform() method on a LinearRegressionModel throws 
> NoSuchElementException
> ---
>
> Key: SPARK-27621
> URL: https://issues.apache.org/jira/browse/SPARK-27621
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2
>Reporter: Anca Sarb
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When transform(...) method is called on a LinearRegressionModel created 
> directly with the coefficients and intercepts, the following exception is 
> encountered.
> {code:java}
> java.util.NoSuchElementException: Failed to find a default value for loss at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at 
> org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at 
> org.apache.spark.ml.param.Params$class.$(params.scala:786) at 
> org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at 
> org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
>  at 
> org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
>  at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) 
> at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at 
> org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at 
> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at 
> org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
> {code}
> This is because validateAndTransformSchema() is called both during training 
> and scoring phases, but the checks against the training related params like 
> loss should really be performed during training phase only, I think, please 
> correct me if I'm missing anything.
> This issue was first reported for mleap 
> ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because 
> basically when we serialize the Spark transformers for mleap, we only 
> serialize the params that are relevant for scoring. We do have the option to 
> de-serialize the serialized transformers back into Spark for scoring again, 
> but in that case, we no longer have all the training params.
> Test to reproduce in PR: [https://github.com/apache/spark/pull/24509]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException

2019-05-02 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831501#comment-16831501
 ] 

Apache Spark commented on SPARK-27621:
--

User 'ancasarb' has created a pull request for this issue:
https://github.com/apache/spark/pull/24509

> Calling transform() method on a LinearRegressionModel throws 
> NoSuchElementException
> ---
>
> Key: SPARK-27621
> URL: https://issues.apache.org/jira/browse/SPARK-27621
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2
>Reporter: Anca Sarb
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When transform(...) method is called on a LinearRegressionModel created 
> directly with the coefficients and intercepts, the following exception is 
> encountered.
> {code:java}
> java.util.NoSuchElementException: Failed to find a default value for loss at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at 
> org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at 
> org.apache.spark.ml.param.Params$class.$(params.scala:786) at 
> org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at 
> org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
>  at 
> org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
>  at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) 
> at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at 
> org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at 
> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at 
> org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
> {code}
> This is because validateAndTransformSchema() is called both during training 
> and scoring phases, but the checks against the training related params like 
> loss should really be performed during training phase only, I think, please 
> correct me if I'm missing anything.
> This issue was first reported for mleap 
> ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because 
> basically when we serialize the Spark transformers for mleap, we only 
> serialize the params that are relevant for scoring. We do have the option to 
> de-serialize the serialized transformers back into Spark for scoring again, 
> but in that case, we no longer have all the training params.
> Test to reproduce in PR: [https://github.com/apache/spark/pull/24509]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27621:


Assignee: Apache Spark

> Calling transform() method on a LinearRegressionModel throws 
> NoSuchElementException
> ---
>
> Key: SPARK-27621
> URL: https://issues.apache.org/jira/browse/SPARK-27621
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2
>Reporter: Anca Sarb
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When transform(...) method is called on a LinearRegressionModel created 
> directly with the coefficients and intercepts, the following exception is 
> encountered.
> {code:java}
> java.util.NoSuchElementException: Failed to find a default value for loss at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at 
> org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at 
> org.apache.spark.ml.param.Params$class.$(params.scala:786) at 
> org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at 
> org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
>  at 
> org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
>  at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) 
> at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at 
> org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at 
> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at 
> org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
> {code}
> This is because validateAndTransformSchema() is called both during training 
> and scoring phases, but the checks against the training related params like 
> loss should really be performed during training phase only, I think, please 
> correct me if I'm missing anything.
> This issue was first reported for mleap 
> ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because 
> basically when we serialize the Spark transformers for mleap, we only 
> serialize the params that are relevant for scoring. We do have the option to 
> de-serialize the serialized transformers back into Spark for scoring again, 
> but in that case, we no longer have all the training params.
> Test to reproduce in PR: [https://github.com/apache/spark/pull/24509]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException

2019-05-02 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27621:


Assignee: (was: Apache Spark)

> Calling transform() method on a LinearRegressionModel throws 
> NoSuchElementException
> ---
>
> Key: SPARK-27621
> URL: https://issues.apache.org/jira/browse/SPARK-27621
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2
>Reporter: Anca Sarb
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When transform(...) method is called on a LinearRegressionModel created 
> directly with the coefficients and intercepts, the following exception is 
> encountered.
> {code:java}
> java.util.NoSuchElementException: Failed to find a default value for loss at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at 
> org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at 
> org.apache.spark.ml.param.Params$class.$(params.scala:786) at 
> org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at 
> org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
>  at 
> org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
>  at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) 
> at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at 
> org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at 
> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at 
> org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
> {code}
> This is because validateAndTransformSchema() is called both during training 
> and scoring phases, but the checks against the training related params like 
> loss should really be performed during training phase only, I think, please 
> correct me if I'm missing anything.
> This issue was first reported for mleap 
> ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because 
> basically when we serialize the Spark transformers for mleap, we only 
> serialize the params that are relevant for scoring. We do have the option to 
> de-serialize the serialized transformers back into Spark for scoring again, 
> but in that case, we no longer have all the training params.
> Test to reproduce in PR: [https://github.com/apache/spark/pull/24509]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException

2019-05-02 Thread Anca Sarb (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anca Sarb updated SPARK-27621:
--
Description: 
When transform(...) method is called on a LinearRegressionModel created 
directly with the coefficients and intercepts, the following exception is 
encountered.
{code:java}
java.util.NoSuchElementException: Failed to find a default value for loss at 
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
 at 
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at 
org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at 
org.apache.spark.ml.param.Params$class.$(params.scala:786) at 
org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at 
org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
 at 
org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
 at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at 
org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
 at 
org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) 
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at 
org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at 
org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at 
org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
{code}
This is because validateAndTransformSchema() is called both during training and 
scoring phases, but the checks against the training related params like loss 
should really be performed during training phase only, I think, please correct 
me if I'm missing anything.

This issue was first reported for mleap 
([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because 
basically when we serialize the Spark transformers for mleap, we only serialize 
the params that are relevant for scoring. We do have the option to de-serialize 
the serialized transformers back into Spark for scoring again, but in that 
case, we no longer have all the training params.

Test to reproduce in PR: [https://github.com/apache/spark/pull/24509]

 

  was:
When transform(...) method is called on a LinearRegressionModel created 
directly with the coefficients and intercepts, the following exception is 
encountered.
{code:java}
java.util.NoSuchElementException: Failed to find a default value for loss at 
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
 at 
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at 
org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at 
org.apache.spark.ml.param.Params$class.$(params.scala:786) at 
org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at 
org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
 at 
org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
 at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at 
org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
 at 
org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) 
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at 
org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at 
org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at 
org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
{code}
This is because validateAndTransformSchema() is called both during training and 
scoring phases, but the checks against the training related params like loss 
should really be performed during training phase only, I think, please correct 
me if I'm missing anything :)

This issue was first reported for mleap 
([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because 
basically when we serialize the Spark transformers for mleap, we only serialize 
the params that are relevant for scoring. We do have the option to de-serialize 
the serialized transformers back into Spark for scoring again, but in that 
case, we no longer have all the

[jira] [Commented] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException

2019-05-02 Thread Anca Sarb (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831499#comment-16831499
 ] 

Anca Sarb commented on SPARK-27621:
---

I've created a PR with the fix here [https://github.com/apache/spark/pull/24509]

> Calling transform() method on a LinearRegressionModel throws 
> NoSuchElementException
> ---
>
> Key: SPARK-27621
> URL: https://issues.apache.org/jira/browse/SPARK-27621
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2
>Reporter: Anca Sarb
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When transform(...) method is called on a LinearRegressionModel created 
> directly with the coefficients and intercepts, the following exception is 
> encountered.
> {code:java}
> java.util.NoSuchElementException: Failed to find a default value for loss at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at 
> org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at 
> org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at 
> org.apache.spark.ml.param.Params$class.$(params.scala:786) at 
> org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at 
> org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
>  at 
> org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
>  at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) 
> at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at 
> org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at 
> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at 
> org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
> {code}
> This is because validateAndTransformSchema() is called both during training 
> and scoring phases, but the checks against the training related params like 
> loss should really be performed during training phase only, I think, please 
> correct me if I'm missing anything :)
> This issue was first reported for mleap 
> ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because 
> basically when we serialize the Spark transformers for mleap, we only 
> serialize the params that are relevant for scoring. We do have the option to 
> de-serialize the serialized transformers back into Spark for scoring again, 
> but in that case, we no longer have all the training params.
> Test to reproduce in PR: [https://github.com/apache/spark/pull/24509]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException

2019-05-02 Thread Anca Sarb (JIRA)

Anca Sarb created SPARK-27621:
-

 Summary: Calling transform() method on a LinearRegressionModel 
throws NoSuchElementException
 Key: SPARK-27621
 URL: https://issues.apache.org/jira/browse/SPARK-27621
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.2, 2.4.1, 2.4.0, 2.3.3, 2.3.2, 2.3.1, 2.3.0, 2.3.4
Reporter: Anca Sarb


When transform(...) method is called on a LinearRegressionModel created 
directly with the coefficients and intercepts, the following exception is 
encountered.
{code:java}
java.util.NoSuchElementException: Failed to find a default value for loss at 
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
 at 
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at 
org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at 
org.apache.spark.ml.param.Params$class.$(params.scala:786) at 
org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at 
org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
 at 
org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
 at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at 
org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
 at 
org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) 
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at 
org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at 
org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at 
org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
{code}
This is because validateAndTransformSchema() is called both during training and 
scoring phases, but the checks against the training related params like loss 
should really be performed during training phase only, I think, please correct 
me if I'm missing anything :)

This issue was first reported for mleap 
([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because 
basically when we serialize the Spark transformers for mleap, we only serialize 
the params that are relevant for scoring. We do have the option to de-serialize 
the serialized transformers back into Spark for scoring again, but in that 
case, we no longer have all the training params.

Test to reproduce in PR: [https://github.com/apache/spark/pull/24509]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

66 matches

Mail list logo