[jira] [Resolved] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27612. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24519 [https://github.com/apache/spark/pull/24519] > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Hyukjin Kwon >Priority: Blocker > Labels: correctness > Fix For: 3.0.0 > > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27612: Assignee: Hyukjin Kwon > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Hyukjin Kwon >Priority: Blocker > Labels: correctness > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27586) Improve binary comparison: replace Scala's for-comprehension if statements with while loop
[ https://issues.apache.org/jira/browse/SPARK-27586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27586: -- Affects Version/s: (was: 2.4.2) 3.0.0 > Improve binary comparison: replace Scala's for-comprehension if statements > with while loop > -- > > Key: SPARK-27586 > URL: https://issues.apache.org/jira/browse/SPARK-27586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: benchmark env: > * Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz > * Linux 4.4.0-33.bm.1-amd64 > * java version "1.8.0_131" > * Scala 2.11.8 > * perf version 4.4.0 > Run: > 40,000,000 times comparison on 32 bytes-length binary > >Reporter: WoudyGao >Assignee: WoudyGao >Priority: Minor > Fix For: 3.0.0 > > > I found the cpu cost of TypeUtils.compareBinary is noticeable when handle > some big parquet files; > After some perf work, I found: > the " for-comprehension if statements" will execute ≈15X instructions than > while loop > > *'while-loop' version perf:* > > {{886.687949 task-clock (msec) # 1.257 CPUs > utilized}} > {{ 3,089 context-switches # 0.003 M/sec}} > {{ 265 cpu-migrations# 0.299 K/sec}} > {{12,227 page-faults # 0.014 M/sec}} > {{ 2,209,183,920 cycles# 2.492 GHz}} > {{ stalled-cycles-frontend}} > {{ stalled-cycles-backend}} > {{ 6,865,836,114 instructions # 3.11 insns per > cycle}} > {{ 1,568,910,228 branches # 1769.405 M/sec}} > {{ 9,172,613 branch-misses # 0.58% of all > branches}} > > {{ 0.705671157 seconds time elapsed}} > > *TypeUtils.compareBinary perf:* > {{ 16347.242313 task-clock (msec) # 1.233 CPUs > utilized}} > {{ 8,370 context-switches # 0.512 K/sec}} > {{ 481 cpu-migrations# 0.029 K/sec}} > {{ 536,671 page-faults # 0.033 M/sec}} > {{40,857,347,119 cycles# 2.499 GHz}} > {{ stalled-cycles-frontend}} > {{ stalled-cycles-backend}} > {{90,606,381,612 instructions # 2.22 insns per > cycle}} > {{18,107,867,151 branches # 1107.702 M/sec}} > {{12,880,296 branch-misses # 0.07% of all > branches}} > > {{ 13.257617118 seconds time elapsed}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27586) Improve binary comparison: replace Scala's for-comprehension if statements with while loop
[ https://issues.apache.org/jira/browse/SPARK-27586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27586. --- Resolution: Fixed Assignee: WoudyGao Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24494 > Improve binary comparison: replace Scala's for-comprehension if statements > with while loop > -- > > Key: SPARK-27586 > URL: https://issues.apache.org/jira/browse/SPARK-27586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.2 > Environment: benchmark env: > * Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz > * Linux 4.4.0-33.bm.1-amd64 > * java version "1.8.0_131" > * Scala 2.11.8 > * perf version 4.4.0 > Run: > 40,000,000 times comparison on 32 bytes-length binary > >Reporter: WoudyGao >Assignee: WoudyGao >Priority: Minor > Fix For: 3.0.0 > > > I found the cpu cost of TypeUtils.compareBinary is noticeable when handle > some big parquet files; > After some perf work, I found: > the " for-comprehension if statements" will execute ≈15X instructions than > while loop > > *'while-loop' version perf:* > > {{886.687949 task-clock (msec) # 1.257 CPUs > utilized}} > {{ 3,089 context-switches # 0.003 M/sec}} > {{ 265 cpu-migrations# 0.299 K/sec}} > {{12,227 page-faults # 0.014 M/sec}} > {{ 2,209,183,920 cycles# 2.492 GHz}} > {{ stalled-cycles-frontend}} > {{ stalled-cycles-backend}} > {{ 6,865,836,114 instructions # 3.11 insns per > cycle}} > {{ 1,568,910,228 branches # 1769.405 M/sec}} > {{ 9,172,613 branch-misses # 0.58% of all > branches}} > > {{ 0.705671157 seconds time elapsed}} > > *TypeUtils.compareBinary perf:* > {{ 16347.242313 task-clock (msec) # 1.233 CPUs > utilized}} > {{ 8,370 context-switches # 0.512 K/sec}} > {{ 481 cpu-migrations# 0.029 K/sec}} > {{ 536,671 page-faults # 0.033 M/sec}} > {{40,857,347,119 cycles# 2.499 GHz}} > {{ stalled-cycles-frontend}} > {{ stalled-cycles-backend}} > {{90,606,381,612 instructions # 2.22 insns per > cycle}} > {{18,107,867,151 branches # 1107.702 M/sec}} > {{12,880,296 branch-misses # 0.07% of all > branches}} > > {{ 13.257617118 seconds time elapsed}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27612: Assignee: Apache Spark > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Blocker > Labels: correctness > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27467) Upgrade Maven to 3.6.1
[ https://issues.apache.org/jira/browse/SPARK-27467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27467. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24481 > Upgrade Maven to 3.6.1 > -- > > Key: SPARK-27467 > URL: https://issues.apache.org/jira/browse/SPARK-27467 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > This issue aim to upgrade Maven to 3.6.1 to bring JDK9+ patches like > MNG-6506. For the full release note, please see the following. > https://maven.apache.org/docs/3.6.1/release-notes.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27612: Assignee: (was: Apache Spark) > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Blocker > Labels: correctness > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27612: - Priority: Blocker (was: Critical) > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Blocker > Labels: correctness > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27623) Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
[ https://issues.apache.org/jira/browse/SPARK-27623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832140#comment-16832140 ] Yuming Wang commented on SPARK-27623: - [~abarbulescu] We have a built-in AVRO data source implementation since SPARK-24768(2.4.0). Could you try to remove {{--jars Spark-Avro 2.4.0}} and load avro by {{spark.read.format("avro").load("/path/to/avro")}} > Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated > --- > > Key: SPARK-27623 > URL: https://issues.apache.org/jira/browse/SPARK-27623 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.2 >Reporter: Alexandru Barbulescu >Priority: Critical > > After updating to spark 2.4.2 when using the > {code:java} > spark.read.format().options().load() > {code} > > chain of methods, regardless of what parameter is passed to "format" we get > the following error related to avro: > > {code:java} > - .options(**load_options) > - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line > 172, in load > - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line > 1257, in __call__ > - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in > deco > - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load. > - : java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.avro.AvroFileFormat could not be instantiated > - at java.util.ServiceLoader.fail(ServiceLoader.java:232) > - at java.util.ServiceLoader.access$100(ServiceLoader.java:185) > - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) > - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) > - at java.util.ServiceLoader$1.next(ServiceLoader.java:480) > - at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) > - at scala.collection.Iterator.foreach(Iterator.scala:941) > - at scala.collection.Iterator.foreach$(Iterator.scala:941) > - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > - at scala.collection.IterableLike.foreach(IterableLike.scala:74) > - at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > - at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250) > - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248) > - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > - at scala.collection.TraversableLike.filter(TraversableLike.scala:262) > - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262) > - at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > - at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630) > - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) > - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) > - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > - at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > - at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > - at java.lang.reflect.Method.invoke(Method.java:498) > - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > - at py4j.Gateway.invoke(Gateway.java:282) > - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > - at py4j.commands.CallCommand.execute(CallCommand.java:79) > - at py4j.GatewayConnection.run(GatewayConnection.java:238) > - at java.lang.Thread.run(Thread.java:748) > - Caused by: java.lang.NoClassDefFoundError: > org/apache/spark/sql/execution/datasources/FileFormat$class > - at org.apache.spark.sql.avro.AvroFileFormat.(AvroFileFormat.scala:44) > - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > - at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > - at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > - at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > - at java.lang.Class.newInstance(Class.java:442) > - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) > - ... 29 more > - Caused by: java.lang.ClassNotFoundException: > org.apache.spark.sql.execution.datasources.FileFormat$class > - at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > - at
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16832136#comment-16832136 ] Bryan Cutler commented on SPARK-27396: -- The revisions sound good to me, it has a little more focus now and I agree it's better to handle exporting columnar data on the other SPIP. I have a question on the nice-to-have point of exposing data in Arrow format. You are meaning that the Arrow Java APIs will not be publicly exposed, only access to raw data I believe, so the user can process the data with Arrow without further conversion? If this is not done, then the user would have to do some minor conversions from Spark internal format to Arrow? Also, how are the batch sizes set? Is it basically one batch per partition or can it be configured to break it up to smaller batches? > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but will be covered in a separate SPIP. > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation. > This does not cover exposing the underlying format of the data. The only way > to get at the data in a ColumnVector is through the public APIs. Exposing > the underlying format to improve efficiency will be covered in a separate > SPIP. > This is not trying to implement new ways of transferring data to external > ML/AI applications. That is covered by separate SPIPs already. > This is not trying to add in generic code generation for columnar processing. > Currently code generation for columnar processing is only supported when > translating columns to rows. We will continue to support this, but will not > extend it as a general solution. That will be covered in a separate SPIP if > we find it is helpful. For now columnar processing will be interpreted. > This is not trying to expose a way to get columnar data into Spark through > DataSource V2 or any other similar API. That would be covered by a separate > SPIP if we find it is needed. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Internal implementations of FileFormats, optionally can return a > ColumnarBatch instead of rows. The code generation phase knows how to take > that columnar data and iterate through it as rows for stages that wants rows, > which currently is almost everything. The limitations here are mostly > implementation specific. The current standard is to abuse Scala’s type > erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The > code generation can handle this because it is generating java code, so it > bypasses scala’s type checking and just casts the InternalRow to the desired > ColumnarBatch. This makes it difficult for others to implement the same >
[jira] [Resolved] (SPARK-27620) Update jetty to 9.4.18.v20190429
[ https://issues.apache.org/jira/browse/SPARK-27620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27620. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24513 [https://github.com/apache/spark/pull/24513] > Update jetty to 9.4.18.v20190429 > > > Key: SPARK-27620 > URL: https://issues.apache.org/jira/browse/SPARK-27620 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: yuming.wang >Priority: Major > Fix For: 3.0.0 > > > Update jetty to 9.4.18.v20190429 because of > [CVE-2019-10247|https://nvd.nist.gov/vuln/detail/CVE-2019-10247]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27620) Update jetty to 9.4.18.v20190429
[ https://issues.apache.org/jira/browse/SPARK-27620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27620: Assignee: yuming.wang > Update jetty to 9.4.18.v20190429 > > > Key: SPARK-27620 > URL: https://issues.apache.org/jira/browse/SPARK-27620 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: yuming.wang >Priority: Major > > Update jetty to 9.4.18.v20190429 because of > [CVE-2019-10247|https://nvd.nist.gov/vuln/detail/CVE-2019-10247]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27627) Make option "pathGlobFilter" as a general option for all file sources
[ https://issues.apache.org/jira/browse/SPARK-27627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27627: Assignee: Apache Spark > Make option "pathGlobFilter" as a general option for all file sources > - > > Key: SPARK-27627 > URL: https://issues.apache.org/jira/browse/SPARK-27627 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Background: > The data source option "pathGlobFilter" is introduced for Binary file format: > https://github.com/apache/spark/pull/24354 , which can be used for filtering > file names, e.g. reading "*.png" files only while there is "*.json" files in > the same directory. > Proposal: > Make the option "pathGlobFilter" as a general option for all file sources. > The path filtering should happen in the path globbing on Driver. > Motivation: > Filtering the file path names in file scan tasks on executors is kind of > ugly. > Impact: > 1. The splitting of file partitions will be more balanced. > 2. The metrics of file scan will be more accurate. > 3. Users can use the option for reading other file sources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27627) Make option "pathGlobFilter" as a general option for all file sources
[ https://issues.apache.org/jira/browse/SPARK-27627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27627: Assignee: (was: Apache Spark) > Make option "pathGlobFilter" as a general option for all file sources > - > > Key: SPARK-27627 > URL: https://issues.apache.org/jira/browse/SPARK-27627 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Background: > The data source option "pathGlobFilter" is introduced for Binary file format: > https://github.com/apache/spark/pull/24354 , which can be used for filtering > file names, e.g. reading "*.png" files only while there is "*.json" files in > the same directory. > Proposal: > Make the option "pathGlobFilter" as a general option for all file sources. > The path filtering should happen in the path globbing on Driver. > Motivation: > Filtering the file path names in file scan tasks on executors is kind of > ugly. > Impact: > 1. The splitting of file partitions will be more balanced. > 2. The metrics of file scan will be more accurate. > 3. Users can use the option for reading other file sources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27627) Make option "pathGlobFilter" as a general option for all file sources
Gengliang Wang created SPARK-27627: -- Summary: Make option "pathGlobFilter" as a general option for all file sources Key: SPARK-27627 URL: https://issues.apache.org/jira/browse/SPARK-27627 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang Background: The data source option "pathGlobFilter" is introduced for Binary file format: https://github.com/apache/spark/pull/24354 , which can be used for filtering file names, e.g. reading "*.png" files only while there is "*.json" files in the same directory. Proposal: Make the option "pathGlobFilter" as a general option for all file sources. The path filtering should happen in the path globbing on Driver. Motivation: Filtering the file path names in file scan tasks on executors is kind of ugly. Impact: 1. The splitting of file partitions will be more balanced. 2. The metrics of file scan will be more accurate. 3. Users can use the option for reading other file sources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27626) Fix `docker-image-tool.sh` to be robust in non-bash shell env
[ https://issues.apache.org/jira/browse/SPARK-27626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27626: Assignee: Apache Spark > Fix `docker-image-tool.sh` to be robust in non-bash shell env > - > > Key: SPARK-27626 > URL: https://issues.apache.org/jira/browse/SPARK-27626 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27626) Fix `docker-image-tool.sh` to be robust in non-bash shell env
[ https://issues.apache.org/jira/browse/SPARK-27626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27626: Assignee: (was: Apache Spark) > Fix `docker-image-tool.sh` to be robust in non-bash shell env > - > > Key: SPARK-27626 > URL: https://issues.apache.org/jira/browse/SPARK-27626 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27626) Fix `docker-image-tool.sh` to be robust in non-bash shell env
Dongjoon Hyun created SPARK-27626: - Summary: Fix `docker-image-tool.sh` to be robust in non-bash shell env Key: SPARK-27626 URL: https://issues.apache.org/jira/browse/SPARK-27626 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27625) ScalaReflection.serializerFor fails for annotated types
[ https://issues.apache.org/jira/browse/SPARK-27625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Grandjean updated SPARK-27625: -- Description: ScalaRelfection.serializerFor fails for annotated type. Example: {code:java} case class Foo( field1: String, field2: Option[String] @Bar ) val rdd: RDD[Foo] = ... val ds = rdd.toDS // fails at runtime{code} The stack trace: {code:java} // code placeholder User class threw exception: scala.MatchError: scala.Option[String] @Bar (of class scala.reflect.internal.Types$AnnotatedType) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445) at ...{code} I believe that it would be safe to ignore the annotation. was: ScalaRelfection.serializerFor fails for annotated type. Example: {code:java} case class Foo( field1: String, field2: Option[String] @Bar ) val rdd: RDD[Foo] = ... val ds = rdd.toDS // fails at runtime{code} The stack trace: {code:java} // code placeholder User class threw exception: scala.MatchError: scala.Option[String] @Bar (of class scala.reflect.internal.Types$AnnotatedType) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445) at ...{code} > ScalaReflection.serializerFor fails for annotated types > --- > > Key: SPARK-27625 > URL: https://issues.apache.org/jira/browse/SPARK-27625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Patrick Grandjean >Priority: Major > > ScalaRelfection.serializerFor fails for annotated type. Example: > {code:java} > case class Foo( > field1: String, > field2: Option[String] @Bar > ) > val rdd: RDD[Foo] = ... > val ds = rdd.toDS // fails at runtime{code} > The stack trace: > {code:java} > // code placeholder > User class threw exception: scala.MatchError: scala.Option[String] @Bar (of > class scala.reflect.internal.Types$AnnotatedType) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445) > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445) > at ...{code} > I believe that it would be safe to ignore the annotation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27625) ScalaReflection.serializerFor fails for annotated types
[ https://issues.apache.org/jira/browse/SPARK-27625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Grandjean updated SPARK-27625: -- Description: ScalaRelfection.serializerFor fails for annotated type. Example: {code:java} case class Foo( field1: String, field2: Option[String] @Bar ) val rdd: RDD[Foo] = ... val ds = rdd.toDS // fails at runtime{code} The stack trace: {code:java} // code placeholder User class threw exception: scala.MatchError: scala.Option[String] @Bar (of class scala.reflect.internal.Types$AnnotatedType) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445) at ...{code} was: ScalaRelfection.serializerFor fails for annotated type. Example: {code:java} case class Foo( field1: String, field2: Option[String] @Bar ) val rdd: RDD[Foo] = ... val ds = rdd.toDS // fails at runtime{code} The stack trace: User class threw exception: scala.MatchError: scala.Option[String] @Bar (of class scala.reflect.internal.Types$AnnotatedType) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445) at > ScalaReflection.serializerFor fails for annotated types > --- > > Key: SPARK-27625 > URL: https://issues.apache.org/jira/browse/SPARK-27625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Patrick Grandjean >Priority: Major > > ScalaRelfection.serializerFor fails for annotated type. Example: > {code:java} > case class Foo( > field1: String, > field2: Option[String] @Bar > ) > val rdd: RDD[Foo] = ... > val ds = rdd.toDS // fails at runtime{code} > The stack trace: > {code:java} > // code placeholder > User class threw exception: scala.MatchError: scala.Option[String] @Bar (of > class scala.reflect.internal.Types$AnnotatedType) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445) > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445) > at ...{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27625) ScalaReflection.serializerFor fails for annotated types
Patrick Grandjean created SPARK-27625: - Summary: ScalaReflection.serializerFor fails for annotated types Key: SPARK-27625 URL: https://issues.apache.org/jira/browse/SPARK-27625 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2 Reporter: Patrick Grandjean ScalaRelfection.serializerFor fails for annotated type. Example: {code:java} case class Foo( field1: String, field2: Option[String] @Bar ) val rdd: RDD[Foo] = ... val ds = rdd.toDS // fails at runtime{code} The stack trace: User class threw exception: scala.MatchError: scala.Option[String] @Bar (of class scala.reflect.internal.Types$AnnotatedType) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445) at -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27601) Upgrade stream-lib to 2.9.6
[ https://issues.apache.org/jira/browse/SPARK-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27601: - Assignee: Yuming Wang Priority: Minor (was: Major) BTW I wouldn't call these "Major", but it doesn't matter much > Upgrade stream-lib to 2.9.6 > --- > > Key: SPARK-27601 > URL: https://issues.apache.org/jira/browse/SPARK-27601 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Fix For: 3.0.0 > > > 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using > native arrays instead of ArrayLists to avoid boxing and unboxing of integers. > https://github.com/addthis/stream-lib/pull/98 > 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on > appropriately transformed encoded values, in this way boxing and unboxing of > integers is avoided. https://github.com/addthis/stream-lib/pull/97 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27601) Upgrade stream-lib to 2.9.6
[ https://issues.apache.org/jira/browse/SPARK-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27601. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24492 [https://github.com/apache/spark/pull/24492] > Upgrade stream-lib to 2.9.6 > --- > > Key: SPARK-27601 > URL: https://issues.apache.org/jira/browse/SPARK-27601 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using > native arrays instead of ArrayLists to avoid boxing and unboxing of integers. > https://github.com/addthis/stream-lib/pull/98 > 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on > appropriately transformed encoded values, in this way boxing and unboxing of > integers is avoided. https://github.com/addthis/stream-lib/pull/97 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27598) DStreams checkpointing does not work with the Spark Shell
[ https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831861#comment-16831861 ] Gabor Somogyi commented on SPARK-27598: --- Shell works differently somehow. I've seen problems with avro roundtrip conversion as well. > DStreams checkpointing does not work with the Spark Shell > - > > Key: SPARK-27598 > URL: https://issues.apache.org/jira/browse/SPARK-27598 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > When I restarted a stream with checkpointing enabled I got this: > {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from > file > [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] > java.io.IOException: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > org.apache.spark.streaming.dstream.FileInputDStream.filter of type > scala.Function1 in instance of > org.apache.spark.streaming.dstream.FileInputDStream > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) > at > org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {quote} > It seems that the closure is stored in the Serialized format and cannot be > assigned back to a scala function1 > Details of how to reproduce it here: > [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] > Maybe this is spark-shell specific and is not expected to work anyway, as I > dont see this to be an issues with a normal jar. > Note that with Spark 2.3.3 the error is different and this still does not > work but with a different error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27624) Fix CalenderInterval to show an empty interval correctly
[ https://issues.apache.org/jira/browse/SPARK-27624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27624: Assignee: Apache Spark > Fix CalenderInterval to show an empty interval correctly > > > Key: SPARK-27624 > URL: https://issues.apache.org/jira/browse/SPARK-27624 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.2, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > *BEFORE* > {code} > scala> spark.readStream.schema("ts > timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain > == Physical Plan == > EventTimeWatermark ts#0: timestamp, interval 1 microseconds > +- StreamingRelation FileSource[/tmp/t], [ts#0] > scala> spark.readStream.schema("ts > timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain > == Physical Plan == > EventTimeWatermark ts#3: timestamp, interval > +- StreamingRelation FileSource[/tmp/t], [ts#3] > {code} > *AFTER* > {code} > scala> spark.readStream.schema("ts > timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain > == Physical Plan == > EventTimeWatermark ts#0: timestamp, interval 1 microseconds > +- StreamingRelation FileSource[/tmp/t], [ts#0] > scala> spark.readStream.schema("ts > timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain > == Physical Plan == > EventTimeWatermark ts#3: timestamp, interval 0 microseconds > +- StreamingRelation FileSource[/tmp/t], [ts#3] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27624) Fix CalenderInterval to show an empty interval correctly
[ https://issues.apache.org/jira/browse/SPARK-27624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27624: Assignee: (was: Apache Spark) > Fix CalenderInterval to show an empty interval correctly > > > Key: SPARK-27624 > URL: https://issues.apache.org/jira/browse/SPARK-27624 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.2, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > *BEFORE* > {code} > scala> spark.readStream.schema("ts > timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain > == Physical Plan == > EventTimeWatermark ts#0: timestamp, interval 1 microseconds > +- StreamingRelation FileSource[/tmp/t], [ts#0] > scala> spark.readStream.schema("ts > timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain > == Physical Plan == > EventTimeWatermark ts#3: timestamp, interval > +- StreamingRelation FileSource[/tmp/t], [ts#3] > {code} > *AFTER* > {code} > scala> spark.readStream.schema("ts > timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain > == Physical Plan == > EventTimeWatermark ts#0: timestamp, interval 1 microseconds > +- StreamingRelation FileSource[/tmp/t], [ts#0] > scala> spark.readStream.schema("ts > timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain > == Physical Plan == > EventTimeWatermark ts#3: timestamp, interval 0 microseconds > +- StreamingRelation FileSource[/tmp/t], [ts#3] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27624) Fix CalenderInterval to show an empty interval correctly
Dongjoon Hyun created SPARK-27624: - Summary: Fix CalenderInterval to show an empty interval correctly Key: SPARK-27624 URL: https://issues.apache.org/jira/browse/SPARK-27624 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.2, 2.3.3, 3.0.0 Reporter: Dongjoon Hyun *BEFORE* {code} scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain == Physical Plan == EventTimeWatermark ts#0: timestamp, interval 1 microseconds +- StreamingRelation FileSource[/tmp/t], [ts#0] scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain == Physical Plan == EventTimeWatermark ts#3: timestamp, interval +- StreamingRelation FileSource[/tmp/t], [ts#3] {code} *AFTER* {code} scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain == Physical Plan == EventTimeWatermark ts#0: timestamp, interval 1 microseconds +- StreamingRelation FileSource[/tmp/t], [ts#0] scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain == Physical Plan == EventTimeWatermark ts#3: timestamp, interval 0 microseconds +- StreamingRelation FileSource[/tmp/t], [ts#3] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
[ https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27194: Assignee: Apache Spark > Job failures when task attempts do not clean up spark-staging parquet files > --- > > Key: SPARK-27194 > URL: https://issues.apache.org/jira/browse/SPARK-27194 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1, 2.3.2, 2.3.3 >Reporter: Reza Safi >Assignee: Apache Spark >Priority: Major > > When a container fails for some reason (for example when killed by yarn for > exceeding memory limits), the subsequent task attempts for the tasks that > were running on that container all fail with a FileAlreadyExistsException. > The original task attempt does not seem to successfully call abortTask (or at > least its "best effort" delete is unsuccessful) and clean up the parquet file > it was writing to, so when later task attempts try to write to the same > spark-staging directory using the same file name, the job fails. > Here is what transpires in the logs: > The container where task 200.0 is running is killed and the task is lost: > {code} > 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on > t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 > GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. > 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage > 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 > exited caused by one of the running tasks) Reason: Container killed by YARN > for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > {code} > The task is re-attempted on a different executor and fails because the > part-00200-blah-blah.c000.snappy.parquet file from the first task attempt > already exists: > {code} > 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 > (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client a.b.c.d already exists > {code} > The job fails when the the configured task attempts (spark.task.maxFailures) > have failed with the same error: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 > in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage > 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > ... > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client i.p.a.d already exists > {code} > SPARK-26682 wasn't the root cause here, since there wasn't any stage > reattempt. > This issue seems to happen when > spark.sql.sources.partitionOverwriteMode=dynamic. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
[ https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27194: Assignee: (was: Apache Spark) > Job failures when task attempts do not clean up spark-staging parquet files > --- > > Key: SPARK-27194 > URL: https://issues.apache.org/jira/browse/SPARK-27194 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1, 2.3.2, 2.3.3 >Reporter: Reza Safi >Priority: Major > > When a container fails for some reason (for example when killed by yarn for > exceeding memory limits), the subsequent task attempts for the tasks that > were running on that container all fail with a FileAlreadyExistsException. > The original task attempt does not seem to successfully call abortTask (or at > least its "best effort" delete is unsuccessful) and clean up the parquet file > it was writing to, so when later task attempts try to write to the same > spark-staging directory using the same file name, the job fails. > Here is what transpires in the logs: > The container where task 200.0 is running is killed and the task is lost: > {code} > 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on > t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 > GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. > 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage > 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 > exited caused by one of the running tasks) Reason: Container killed by YARN > for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > {code} > The task is re-attempted on a different executor and fails because the > part-00200-blah-blah.c000.snappy.parquet file from the first task attempt > already exists: > {code} > 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 > (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client a.b.c.d already exists > {code} > The job fails when the the configured task attempts (spark.task.maxFailures) > have failed with the same error: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 > in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage > 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > ... > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client i.p.a.d already exists > {code} > SPARK-26682 wasn't the root cause here, since there wasn't any stage > reattempt. > This issue seems to happen when > spark.sql.sources.partitionOverwriteMode=dynamic. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
[ https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-27194: -- Assignee: (was: Marcelo Vanzin) > Job failures when task attempts do not clean up spark-staging parquet files > --- > > Key: SPARK-27194 > URL: https://issues.apache.org/jira/browse/SPARK-27194 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1, 2.3.2, 2.3.3 >Reporter: Reza Safi >Priority: Major > > When a container fails for some reason (for example when killed by yarn for > exceeding memory limits), the subsequent task attempts for the tasks that > were running on that container all fail with a FileAlreadyExistsException. > The original task attempt does not seem to successfully call abortTask (or at > least its "best effort" delete is unsuccessful) and clean up the parquet file > it was writing to, so when later task attempts try to write to the same > spark-staging directory using the same file name, the job fails. > Here is what transpires in the logs: > The container where task 200.0 is running is killed and the task is lost: > {code} > 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on > t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 > GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. > 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage > 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 > exited caused by one of the running tasks) Reason: Container killed by YARN > for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > {code} > The task is re-attempted on a different executor and fails because the > part-00200-blah-blah.c000.snappy.parquet file from the first task attempt > already exists: > {code} > 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 > (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client a.b.c.d already exists > {code} > The job fails when the the configured task attempts (spark.task.maxFailures) > have failed with the same error: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 > in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage > 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > ... > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client i.p.a.d already exists > {code} > SPARK-26682 wasn't the root cause here, since there wasn't any stage > reattempt. > This issue seems to happen when > spark.sql.sources.partitionOverwriteMode=dynamic. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
[ https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-27194: -- Assignee: Marcelo Vanzin > Job failures when task attempts do not clean up spark-staging parquet files > --- > > Key: SPARK-27194 > URL: https://issues.apache.org/jira/browse/SPARK-27194 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1, 2.3.2, 2.3.3 >Reporter: Reza Safi >Assignee: Marcelo Vanzin >Priority: Major > > When a container fails for some reason (for example when killed by yarn for > exceeding memory limits), the subsequent task attempts for the tasks that > were running on that container all fail with a FileAlreadyExistsException. > The original task attempt does not seem to successfully call abortTask (or at > least its "best effort" delete is unsuccessful) and clean up the parquet file > it was writing to, so when later task attempts try to write to the same > spark-staging directory using the same file name, the job fails. > Here is what transpires in the logs: > The container where task 200.0 is running is killed and the task is lost: > {code} > 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on > t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 > GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. > 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage > 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 > exited caused by one of the running tasks) Reason: Container killed by YARN > for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > {code} > The task is re-attempted on a different executor and fails because the > part-00200-blah-blah.c000.snappy.parquet file from the first task attempt > already exists: > {code} > 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 > (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client a.b.c.d already exists > {code} > The job fails when the the configured task attempts (spark.task.maxFailures) > have failed with the same error: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 > in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage > 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > ... > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client i.p.a.d already exists > {code} > SPARK-26682 wasn't the root cause here, since there wasn't any stage > reattempt. > This issue seems to happen when > spark.sql.sources.partitionOverwriteMode=dynamic. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14083: Assignee: (was: Apache Spark) > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Priority: Major > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions
[ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14083: Assignee: Apache Spark > Analyze JVM bytecode and turn closures into Catalyst expressions > > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark >Priority: Major > > One big advantage of the Dataset API is the type safety, at the cost of > performance due to heavy reliance on user-defined closures/lambdas. These > closures are typically slower than expressions because we have more > flexibility to optimize expressions (known data types, no virtual function > calls, etc). In many cases, it's actually not going to be very difficult to > look into the byte code of these closures and figure out what they are trying > to do. If we can understand them, then we can turn them directly into > Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), > lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis > and use that to convert closures/lambdas into Catalyst expressions in order > to speed up Dataset execution. It is a little bit futuristic, but I believe > it is very doable. The framework should be easy to reason about (e.g. similar > to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch > should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27619) MapType should be prohibited in hash expressions
[ https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-27619: --- Description: Spark currently allows MapType expressions to be used as input to hash expressions, but I think that this should be prohibited because Spark SQL does not support map equality. Currently, Spark SQL's map hashcodes are sensitive to the insertion order of map elements: {code:java} val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code} This behavior might be surprising to Scala developers. I think there's precedence for banning the use of MapType here because we already prohibit MapType in aggregation / joins / equality comparisons (SPARK-9415) and set operations (SPARK-19893). If we decide that we want this to be an error then it might also be a good idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the old and buggy behavior (in case applications were relying on it in cases where it just so happens to be safe-by-accident (e.g. maps which only have one entry)). Alternatively, we could support hashing here if we implemented support for comparable map types (SPARK-18134). was: Spark currently allows MapType expressions to be used as input to hash expressions, but I think that this should be prohibited because Spark SQL does not support map equality. Currently, Spark SQL's map hashcodes are sensitive to the insertion order of map elements: {code:java} val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) # Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) # In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code} This behavior might be surprising to Scala developers. I think there's precedence for banning the use of MapType here because we already prohibit MapType in aggregation / joins / equality comparisons (SPARK-9415) and set operations (SPARK-19893). If we decide that we want this to be an error then it might also be a good idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the old and buggy behavior (in case applications were relying on it in cases where it just so happens to be safe-by-accident (e.g. maps which only have one entry)). Alternatively, we could support hashing here if we implemented support for comparable map types (SPARK-18134). > MapType should be prohibited in hash expressions > > > Key: SPARK-27619 > URL: https://issues.apache.org/jira/browse/SPARK-27619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > Labels: correctness > > Spark currently allows MapType expressions to be used as input to hash > expressions, but I think that this should be prohibited because Spark SQL > does not support map equality. > Currently, Spark SQL's map hashcodes are sensitive to the insertion order of > map elements: > {code:java} > val a = spark.createDataset(Map(1->1, 2->2) :: Nil) > val b = spark.createDataset(Map(2->2, 1->1) :: Nil) > // Demonstration of how Scala Map equality is unaffected by insertion order: > assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) > assert(Map(1->1, 2->2) == Map(2->2, 1->1)) > assert(a.first() == b.first()) > // In contrast, this will print two different hashcodes: > println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code} > This behavior might be surprising to Scala developers. > I think there's precedence for banning the use of MapType here because we > already prohibit MapType in aggregation / joins / equality comparisons > (SPARK-9415) and set operations (SPARK-19893). > If we decide that we want this to be an error then it might also be a good > idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the > old and buggy behavior (in case applications were relying on it in cases > where it just so happens to be safe-by-accident (e.g. maps which only have > one entry)). > Alternatively, we could support hashing here if we implemented support for > comparable map types (SPARK-18134). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SPARK-27619) MapType should be prohibited in hash expressions
[ https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-27619: --- Description: Spark currently allows MapType expressions to be used as input to hash expressions, but I think that this should be prohibited because Spark SQL does not support map equality. Currently, Spark SQL's map hashcodes are sensitive to the insertion order of map elements: {code:java} val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) # Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) # In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code} This behavior might be surprising to Scala developers. I think there's precedence for banning the use of MapType here because we already prohibit MapType in aggregation / joins / equality comparisons (SPARK-9415) and set operations (SPARK-19893). If we decide that we want this to be an error then it might also be a good idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the old and buggy behavior (in case applications were relying on it in cases where it just so happens to be safe-by-accident (e.g. maps which only have one entry)). Alternatively, we could support hashing here if we implemented support for comparable map types (SPARK-18134). was: Spark currently allows MapType expressions to be used as input to hash expressions, but I think that this should be prohibited because Spark SQL does not support map equality. Currently, Spark SQL's map hashcodes are sensitive to the insertion order of map elements: {code:java} val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) # Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) # In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code} This behavior might be surprising to Scala developers. I think there's precedence for banning the use of MapType here because we already prohibit MapType in aggregation / joins / equality comparisons (SPARK-9415) and set operations (SPARK-19893). Alternatively, we could support hashing here if we implemented support for comparable map types (SPARK-18134). > MapType should be prohibited in hash expressions > > > Key: SPARK-27619 > URL: https://issues.apache.org/jira/browse/SPARK-27619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > Labels: correctness > > Spark currently allows MapType expressions to be used as input to hash > expressions, but I think that this should be prohibited because Spark SQL > does not support map equality. > Currently, Spark SQL's map hashcodes are sensitive to the insertion order of > map elements: > {code:java} > val a = spark.createDataset(Map(1->1, 2->2) :: Nil) > val b = spark.createDataset(Map(2->2, 1->1) :: Nil) > # Demonstration of how Scala Map equality is unaffected by insertion order: > assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) > assert(Map(1->1, 2->2) == Map(2->2, 1->1)) > assert(a.first() == b.first()) > # In contrast, this will print two different hashcodes: > println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code} > This behavior might be surprising to Scala developers. > I think there's precedence for banning the use of MapType here because we > already prohibit MapType in aggregation / joins / equality comparisons > (SPARK-9415) and set operations (SPARK-19893). > If we decide that we want this to be an error then it might also be a good > idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the > old and buggy behavior (in case applications were relying on it in cases > where it just so happens to be safe-by-accident (e.g. maps which only have > one entry)). > Alternatively, we could support hashing here if we implemented support for > comparable map types (SPARK-18134). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27623) Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
[ https://issues.apache.org/jira/browse/SPARK-27623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831684#comment-16831684 ] Alexandru Barbulescu commented on SPARK-27623: -- The problem might be related to the fact that in Spark 2.4.2, the pre-built convenience binaries are compiled for Scala 2.12. And the spark-cassandra-connector, that I also included, currently only supports 2.11. > Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated > --- > > Key: SPARK-27623 > URL: https://issues.apache.org/jira/browse/SPARK-27623 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.2 >Reporter: Alexandru Barbulescu >Priority: Critical > > After updating to spark 2.4.2 when using the > {code:java} > spark.read.format().options().load() > {code} > > chain of methods, regardless of what parameter is passed to "format" we get > the following error related to avro: > > {code:java} > - .options(**load_options) > - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line > 172, in load > - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line > 1257, in __call__ > - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in > deco > - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load. > - : java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.avro.AvroFileFormat could not be instantiated > - at java.util.ServiceLoader.fail(ServiceLoader.java:232) > - at java.util.ServiceLoader.access$100(ServiceLoader.java:185) > - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) > - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) > - at java.util.ServiceLoader$1.next(ServiceLoader.java:480) > - at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) > - at scala.collection.Iterator.foreach(Iterator.scala:941) > - at scala.collection.Iterator.foreach$(Iterator.scala:941) > - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > - at scala.collection.IterableLike.foreach(IterableLike.scala:74) > - at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > - at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250) > - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248) > - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > - at scala.collection.TraversableLike.filter(TraversableLike.scala:262) > - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262) > - at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > - at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630) > - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) > - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) > - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > - at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > - at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > - at java.lang.reflect.Method.invoke(Method.java:498) > - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > - at py4j.Gateway.invoke(Gateway.java:282) > - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > - at py4j.commands.CallCommand.execute(CallCommand.java:79) > - at py4j.GatewayConnection.run(GatewayConnection.java:238) > - at java.lang.Thread.run(Thread.java:748) > - Caused by: java.lang.NoClassDefFoundError: > org/apache/spark/sql/execution/datasources/FileFormat$class > - at org.apache.spark.sql.avro.AvroFileFormat.(AvroFileFormat.scala:44) > - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > - at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > - at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > - at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > - at java.lang.Class.newInstance(Class.java:442) > - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) > - ... 29 more > - Caused by: java.lang.ClassNotFoundException: > org.apache.spark.sql.execution.datasources.FileFormat$class > - at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > -
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831680#comment-16831680 ] Liang-Chi Hsieh commented on SPARK-27612: - yeah, seems the issue is happened when python object gets pickled... > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Critical > Labels: correctness > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27607) Improve performance of Row.toString()
[ https://issues.apache.org/jira/browse/SPARK-27607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27607. --- Resolution: Fixed Assignee: Marco Gaido Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24505 > Improve performance of Row.toString() > - > > Key: SPARK-27607 > URL: https://issues.apache.org/jira/browse/SPARK-27607 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Assignee: Marco Gaido >Priority: Minor > Fix For: 3.0.0 > > > I have a job which ends up calling {{org.apache.spark.sql.Row.toString}} on > every row in a massive dataset (the reasons for this are slightly odd and > it's a bit non-trivial to change the job to avoid this step). > {{Row.toString}} is implemented by first constructing a WrappedArray > containing the Row's values (by calling {{toSeq}}) and then turning that > array into a string with {{mkString}}. We might be able to get a small > performance win by pipelining these steps, using an imperative loop to append > fields to a StringBuilder as soon as they're retrieved (thereby cutting out a > few layers of Scala collections indirection). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27623) Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
Alexandru Barbulescu created SPARK-27623: Summary: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated Key: SPARK-27623 URL: https://issues.apache.org/jira/browse/SPARK-27623 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.2 Reporter: Alexandru Barbulescu After updating to spark 2.4.2 when using the {code:java} spark.read.format().options().load() {code} chain of methods, regardless of what parameter is passed to "format" we get the following error related to avro: {code:java} - .options(**load_options) - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load. - : java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated - at java.util.ServiceLoader.fail(ServiceLoader.java:232) - at java.util.ServiceLoader.access$100(ServiceLoader.java:185) - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) - at java.util.ServiceLoader$1.next(ServiceLoader.java:480) - at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) - at scala.collection.Iterator.foreach(Iterator.scala:941) - at scala.collection.Iterator.foreach$(Iterator.scala:941) - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) - at scala.collection.IterableLike.foreach(IterableLike.scala:74) - at scala.collection.IterableLike.foreach$(IterableLike.scala:73) - at scala.collection.AbstractIterable.foreach(Iterable.scala:56) - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250) - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248) - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) - at scala.collection.TraversableLike.filter(TraversableLike.scala:262) - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262) - at scala.collection.AbstractTraversable.filter(Traversable.scala:108) - at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630) - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) - at java.lang.reflect.Method.invoke(Method.java:498) - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) - at py4j.Gateway.invoke(Gateway.java:282) - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) - at py4j.commands.CallCommand.execute(CallCommand.java:79) - at py4j.GatewayConnection.run(GatewayConnection.java:238) - at java.lang.Thread.run(Thread.java:748) - Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/FileFormat$class - at org.apache.spark.sql.avro.AvroFileFormat.(AvroFileFormat.scala:44) - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) - at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) - at java.lang.reflect.Constructor.newInstance(Constructor.java:423) - at java.lang.Class.newInstance(Class.java:442) - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) - ... 29 more - Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.FileFormat$class - at java.net.URLClassLoader.findClass(URLClassLoader.java:382) - at java.lang.ClassLoader.loadClass(ClassLoader.java:424) - at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) - at java.lang.ClassLoader.loadClass(ClassLoader.java:357) - ... 36 more {code} The code we run looks like this: {code:java} spark_session = ( SparkSession.builder .appName(APPLICATION_NAME) .master(MASTER_URL) .config('spark.cassandra.connection.host', SERVER_IP_ADDRESS) .config('spark.cassandra.auth.username', CASSANDRA_USERNAME) .config('spark.cassandra.auth.password', CASSANDRA_PASSWORD) .config('spark.sql.shuffle.partitions', 16) .config('parquet.enable.summary-metadata', 'true')
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831623#comment-16831623 ] Hyukjin Kwon commented on SPARK-27612: -- Argh, this happens after we upgraded the cloudpickle to 0.6.2 https://github.com/apache/spark/commit/75ea89ad94ca76646e4697cf98c78d14c6e2695f#diff-19fd865e0dd0d7e6b04b3b1e047dcda7 Upgrading cloudpickle to 0.8.1 still doesn't solve the problem .. I think we should fix it in cloudpickle, I made a cloudpickle release and we port that change into Spark. > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Critical > Labels: correctness > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831602#comment-16831602 ] Hyukjin Kwon commented on SPARK-27612: -- Argh, seems to be a regression. {code} >>> from pyspark.sql.types import ArrayType, IntegerType >>> df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), >>> True)) >>> df.distinct().collect() [Row(value=[1, 2, 3, 4])] {code} Doesn't happen in Spark 2.4.1 and Spark 2.3.3 > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23098) Migrate Kafka batch source to v2
[ https://issues.apache.org/jira/browse/SPARK-23098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831601#comment-16831601 ] Gabor Somogyi commented on SPARK-23098: --- [~DylanGuedes] are you working on this? > Migrate Kafka batch source to v2 > > > Key: SPARK-23098 > URL: https://issues.apache.org/jira/browse/SPARK-23098 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27612: - Labels: correctness (was: ) > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Critical > Labels: correctness > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27612: - Priority: Critical (was: Major) > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Critical > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831600#comment-16831600 ] Liang-Chi Hsieh commented on SPARK-27612: - Yup, I can reproduce it too. No worry [~bryanc]. :) Will take some time to look into it. > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831596#comment-16831596 ] Hyukjin Kwon commented on SPARK-27612: -- haha, you're not crazy {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Python version 3.7.3 (default, Mar 27 2019 09:23:15) SparkSession available as 'spark'. >>> from pyspark.sql.types import ArrayType, IntegerType >>> df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), >>> True)) >>> df.distinct().collect() [Row(value=[None, None]), Row(value=[1, 2, 3, 4])] {code} > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26759) Arrow optimization in SparkR's interoperability
[ https://issues.apache.org/jira/browse/SPARK-26759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26759. -- Resolution: Done > Arrow optimization in SparkR's interoperability > --- > > Key: SPARK-26759 > URL: https://issues.apache.org/jira/browse/SPARK-26759 > Project: Spark > Issue Type: Umbrella > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Arrow 0.12.0 is release and it contains R API. We could optimize Spark > DaraFrame <> R DataFrame interoperability. > For instance see the examples below: > - {{dapply}} > {code:java} > df <- createDataFrame(mtcars) > collect(dapply(df, >function(r.data.frame) { > data.frame(r.data.frame$gear) >}, >structType("gear long"))) > {code} > - {{gapply}} > {code:java} > df <- createDataFrame(mtcars) > collect(gapply(df, >"gear", >function(key, group) { > data.frame(gear = key[[1]], disp = mean(group$disp) > > group$disp) >}, >structType("gear double, disp boolean"))) > {code} > - R DataFrame -> Spark DataFrame > {code:java} > createDataFrame(mtcars) > {code} > - Spark DataFrame -> R DataFrame > {code:java} > collect(df) > head(df) > {code} > Currently, some of communication path between R side and JVM side has to > buffer the data and flush it at once due to ARROW-4512. I don't target to fix > it under this umbrella. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26759) Arrow optimization in SparkR's interoperability
[ https://issues.apache.org/jira/browse/SPARK-26759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26759: - Fix Version/s: 3.0.0 > Arrow optimization in SparkR's interoperability > --- > > Key: SPARK-26759 > URL: https://issues.apache.org/jira/browse/SPARK-26759 > Project: Spark > Issue Type: Umbrella > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Arrow 0.12.0 is release and it contains R API. We could optimize Spark > DaraFrame <> R DataFrame interoperability. > For instance see the examples below: > - {{dapply}} > {code:java} > df <- createDataFrame(mtcars) > collect(dapply(df, >function(r.data.frame) { > data.frame(r.data.frame$gear) >}, >structType("gear long"))) > {code} > - {{gapply}} > {code:java} > df <- createDataFrame(mtcars) > collect(gapply(df, >"gear", >function(key, group) { > data.frame(gear = key[[1]], disp = mean(group$disp) > > group$disp) >}, >structType("gear double, disp boolean"))) > {code} > - R DataFrame -> Spark DataFrame > {code:java} > createDataFrame(mtcars) > {code} > - Spark DataFrame -> R DataFrame > {code:java} > collect(df) > head(df) > {code} > Currently, some of communication path between R side and JVM side has to > buffer the data and flush it at once due to ARROW-4512. I don't target to fix > it under this umbrella. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26924) Fix CRAN hack as soon as Arrow is available on CRAN
[ https://issues.apache.org/jira/browse/SPARK-26924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831594#comment-16831594 ] Hyukjin Kwon commented on SPARK-26924: -- I am going to make this as a standalone issue. I dont' think we can fix it within 3.0.0 > Fix CRAN hack as soon as Arrow is available on CRAN > --- > > Key: SPARK-26924 > URL: https://issues.apache.org/jira/browse/SPARK-26924 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Arrow optimization was added but Arrow is not available on CRAN. > So, it had to add some hacks to avoid CRAN check in SparkR side. For example, > see > https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1 > These should be removed to properly check CRAN in SparkR > See also ARROW-3204 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26924) Fix CRAN hack as soon as Arrow is available on CRAN
[ https://issues.apache.org/jira/browse/SPARK-26924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26924: - Issue Type: Bug (was: Sub-task) Parent: (was: SPARK-26759) > Fix CRAN hack as soon as Arrow is available on CRAN > --- > > Key: SPARK-26924 > URL: https://issues.apache.org/jira/browse/SPARK-26924 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Arrow optimization was added but Arrow is not available on CRAN. > So, it had to add some hacks to avoid CRAN check in SparkR side. For example, > see > https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1 > These should be removed to properly check CRAN in SparkR > See also ARROW-3204 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26921) Document Arrow optimization and vectorized R APIs
[ https://issues.apache.org/jira/browse/SPARK-26921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26921: - Summary: Document Arrow optimization and vectorized R APIs (was: Fix CRAN hack as soon as Arrow is available on CRAN) > Document Arrow optimization and vectorized R APIs > - > > Key: SPARK-26921 > URL: https://issues.apache.org/jira/browse/SPARK-26921 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > We should update SparkR guide documentation, and some related documents, > comments like in {{SQLConf.scala}} when most of tasks are finished -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26924) Fix CRAN hack as soon as Arrow is available on CRAN
[ https://issues.apache.org/jira/browse/SPARK-26924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26924: - Summary: Fix CRAN hack as soon as Arrow is available on CRAN (was: Document Arrow optimization and vectorized R APIs) > Fix CRAN hack as soon as Arrow is available on CRAN > --- > > Key: SPARK-26924 > URL: https://issues.apache.org/jira/browse/SPARK-26924 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Arrow optimization was added but Arrow is not available on CRAN. > So, it had to add some hacks to avoid CRAN check in SparkR side. For example, > see > https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1 > These should be removed to properly check CRAN in SparkR > See also ARROW-3204 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26921) Fix CRAN hack as soon as Arrow is available on CRAN
[ https://issues.apache.org/jira/browse/SPARK-26921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26921: - Description: We should update SparkR guide documentation, and some related documents, comments like in {{SQLConf.scala}} when most of tasks are finished was: Arrow optimization was added but Arrow is not available on CRAN. So, it had to add some hacks to avoid CRAN check in SparkR side. For example, see https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1 These should be removed to properly check CRAN in SparkR See also ARROW-3204 > Fix CRAN hack as soon as Arrow is available on CRAN > --- > > Key: SPARK-26921 > URL: https://issues.apache.org/jira/browse/SPARK-26921 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > We should update SparkR guide documentation, and some related documents, > comments like in {{SQLConf.scala}} when most of tasks are finished -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26924) Document Arrow optimization and vectorized R APIs
[ https://issues.apache.org/jira/browse/SPARK-26924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26924: - Description: Arrow optimization was added but Arrow is not available on CRAN. So, it had to add some hacks to avoid CRAN check in SparkR side. For example, see https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1 These should be removed to properly check CRAN in SparkR See also ARROW-3204 was:We should update SparkR guide documentation, and some related documents, comments like in {{SQLConf.scala}} when most of tasks are finished. > Document Arrow optimization and vectorized R APIs > - > > Key: SPARK-26924 > URL: https://issues.apache.org/jira/browse/SPARK-26924 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Arrow optimization was added but Arrow is not available on CRAN. > So, it had to add some hacks to avoid CRAN check in SparkR side. For example, > see > https://github.com/apache/spark/search?q=requireNamespace1_q=requireNamespace1 > These should be removed to properly check CRAN in SparkR > See also ARROW-3204 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27606) Deprecate `extended` field in ExpressionDescription/ExpressionInfo
[ https://issues.apache.org/jira/browse/SPARK-27606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27606. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24500 [https://github.com/apache/spark/pull/24500] > Deprecate `extended` field in ExpressionDescription/ExpressionInfo > -- > > Key: SPARK-27606 > URL: https://issues.apache.org/jira/browse/SPARK-27606 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > As of SPARK-21485 and SPARK-27328, we have nicer way to separately describe > extended usages. > `extended` field and method at ExpressionDescription/ExpressionInfo is now > pretty useless. > This Jira targets to deprecate it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27606) Deprecate `extended` field in ExpressionDescription/ExpressionInfo
[ https://issues.apache.org/jira/browse/SPARK-27606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27606: Assignee: Hyukjin Kwon > Deprecate `extended` field in ExpressionDescription/ExpressionInfo > -- > > Key: SPARK-27606 > URL: https://issues.apache.org/jira/browse/SPARK-27606 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > As of SPARK-21485 and SPARK-27328, we have nicer way to separately describe > extended usages. > `extended` field and method at ExpressionDescription/ExpressionInfo is now > pretty useless. > This Jira targets to deprecate it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27622) Avoiding network communication when block mangers are running on the host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831553#comment-16831553 ] Attila Zsolt Piros commented on SPARK-27622: I am already working on this. A working prototype for RDD blocks are ready and working. > Avoiding network communication when block mangers are running on the host > -- > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27622) Avoiding network communication when block mangers are running on the host
Attila Zsolt Piros created SPARK-27622: -- Summary: Avoiding network communication when block mangers are running on the host Key: SPARK-27622 URL: https://issues.apache.org/jira/browse/SPARK-27622 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 3.0.0 Reporter: Attila Zsolt Piros Currently fetching blocks always uses the network even when the two block managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anca Sarb updated SPARK-27621: -- Comment: was deleted (was: I've created a PR with the fix here [https://github.com/apache/spark/pull/24509]) > Calling transform() method on a LinearRegressionModel throws > NoSuchElementException > --- > > Key: SPARK-27621 > URL: https://issues.apache.org/jira/browse/SPARK-27621 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2 >Reporter: Anca Sarb >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > When transform(...) method is called on a LinearRegressionModel created > directly with the coefficients and intercepts, the following exception is > encountered. > {code:java} > java.util.NoSuchElementException: Failed to find a default value for loss at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at > org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at > org.apache.spark.ml.param.Params$class.$(params.scala:786) at > org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at > org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) > at > org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at > org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at > org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) > {code} > This is because validateAndTransformSchema() is called both during training > and scoring phases, but the checks against the training related params like > loss should really be performed during training phase only, I think, please > correct me if I'm missing anything. > This issue was first reported for mleap > ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because > basically when we serialize the Spark transformers for mleap, we only > serialize the params that are relevant for scoring. We do have the option to > de-serialize the serialized transformers back into Spark for scoring again, > but in that case, we no longer have all the training params. > Test to reproduce in PR: [https://github.com/apache/spark/pull/24509] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831502#comment-16831502 ] Apache Spark commented on SPARK-27621: -- User 'ancasarb' has created a pull request for this issue: https://github.com/apache/spark/pull/24509 > Calling transform() method on a LinearRegressionModel throws > NoSuchElementException > --- > > Key: SPARK-27621 > URL: https://issues.apache.org/jira/browse/SPARK-27621 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2 >Reporter: Anca Sarb >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > When transform(...) method is called on a LinearRegressionModel created > directly with the coefficients and intercepts, the following exception is > encountered. > {code:java} > java.util.NoSuchElementException: Failed to find a default value for loss at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at > org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at > org.apache.spark.ml.param.Params$class.$(params.scala:786) at > org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at > org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) > at > org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at > org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at > org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) > {code} > This is because validateAndTransformSchema() is called both during training > and scoring phases, but the checks against the training related params like > loss should really be performed during training phase only, I think, please > correct me if I'm missing anything. > This issue was first reported for mleap > ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because > basically when we serialize the Spark transformers for mleap, we only > serialize the params that are relevant for scoring. We do have the option to > de-serialize the serialized transformers back into Spark for scoring again, > but in that case, we no longer have all the training params. > Test to reproduce in PR: [https://github.com/apache/spark/pull/24509] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831501#comment-16831501 ] Apache Spark commented on SPARK-27621: -- User 'ancasarb' has created a pull request for this issue: https://github.com/apache/spark/pull/24509 > Calling transform() method on a LinearRegressionModel throws > NoSuchElementException > --- > > Key: SPARK-27621 > URL: https://issues.apache.org/jira/browse/SPARK-27621 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2 >Reporter: Anca Sarb >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > When transform(...) method is called on a LinearRegressionModel created > directly with the coefficients and intercepts, the following exception is > encountered. > {code:java} > java.util.NoSuchElementException: Failed to find a default value for loss at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at > org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at > org.apache.spark.ml.param.Params$class.$(params.scala:786) at > org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at > org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) > at > org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at > org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at > org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) > {code} > This is because validateAndTransformSchema() is called both during training > and scoring phases, but the checks against the training related params like > loss should really be performed during training phase only, I think, please > correct me if I'm missing anything. > This issue was first reported for mleap > ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because > basically when we serialize the Spark transformers for mleap, we only > serialize the params that are relevant for scoring. We do have the option to > de-serialize the serialized transformers back into Spark for scoring again, > but in that case, we no longer have all the training params. > Test to reproduce in PR: [https://github.com/apache/spark/pull/24509] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27621: Assignee: Apache Spark > Calling transform() method on a LinearRegressionModel throws > NoSuchElementException > --- > > Key: SPARK-27621 > URL: https://issues.apache.org/jira/browse/SPARK-27621 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2 >Reporter: Anca Sarb >Assignee: Apache Spark >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > When transform(...) method is called on a LinearRegressionModel created > directly with the coefficients and intercepts, the following exception is > encountered. > {code:java} > java.util.NoSuchElementException: Failed to find a default value for loss at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at > org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at > org.apache.spark.ml.param.Params$class.$(params.scala:786) at > org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at > org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) > at > org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at > org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at > org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) > {code} > This is because validateAndTransformSchema() is called both during training > and scoring phases, but the checks against the training related params like > loss should really be performed during training phase only, I think, please > correct me if I'm missing anything. > This issue was first reported for mleap > ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because > basically when we serialize the Spark transformers for mleap, we only > serialize the params that are relevant for scoring. We do have the option to > de-serialize the serialized transformers back into Spark for scoring again, > but in that case, we no longer have all the training params. > Test to reproduce in PR: [https://github.com/apache/spark/pull/24509] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27621: Assignee: (was: Apache Spark) > Calling transform() method on a LinearRegressionModel throws > NoSuchElementException > --- > > Key: SPARK-27621 > URL: https://issues.apache.org/jira/browse/SPARK-27621 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2 >Reporter: Anca Sarb >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > When transform(...) method is called on a LinearRegressionModel created > directly with the coefficients and intercepts, the following exception is > encountered. > {code:java} > java.util.NoSuchElementException: Failed to find a default value for loss at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at > org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at > org.apache.spark.ml.param.Params$class.$(params.scala:786) at > org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at > org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) > at > org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at > org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at > org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) > {code} > This is because validateAndTransformSchema() is called both during training > and scoring phases, but the checks against the training related params like > loss should really be performed during training phase only, I think, please > correct me if I'm missing anything. > This issue was first reported for mleap > ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because > basically when we serialize the Spark transformers for mleap, we only > serialize the params that are relevant for scoring. We do have the option to > de-serialize the serialized transformers back into Spark for scoring again, > but in that case, we no longer have all the training params. > Test to reproduce in PR: [https://github.com/apache/spark/pull/24509] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anca Sarb updated SPARK-27621: -- Description: When transform(...) method is called on a LinearRegressionModel created directly with the coefficients and intercepts, the following exception is encountered. {code:java} java.util.NoSuchElementException: Failed to find a default value for loss at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:786) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) at org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) {code} This is because validateAndTransformSchema() is called both during training and scoring phases, but the checks against the training related params like loss should really be performed during training phase only, I think, please correct me if I'm missing anything. This issue was first reported for mleap ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because basically when we serialize the Spark transformers for mleap, we only serialize the params that are relevant for scoring. We do have the option to de-serialize the serialized transformers back into Spark for scoring again, but in that case, we no longer have all the training params. Test to reproduce in PR: [https://github.com/apache/spark/pull/24509] was: When transform(...) method is called on a LinearRegressionModel created directly with the coefficients and intercepts, the following exception is encountered. {code:java} java.util.NoSuchElementException: Failed to find a default value for loss at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:786) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) at org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) {code} This is because validateAndTransformSchema() is called both during training and scoring phases, but the checks against the training related params like loss should really be performed during training phase only, I think, please correct me if I'm missing anything :) This issue was first reported for mleap ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because basically when we serialize the Spark transformers for mleap, we only serialize the params that are relevant for scoring. We do have the option to de-serialize the serialized transformers back into Spark for scoring again, but in that case, we no longer have all the
[jira] [Commented] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831499#comment-16831499 ] Anca Sarb commented on SPARK-27621: --- I've created a PR with the fix here [https://github.com/apache/spark/pull/24509] > Calling transform() method on a LinearRegressionModel throws > NoSuchElementException > --- > > Key: SPARK-27621 > URL: https://issues.apache.org/jira/browse/SPARK-27621 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2 >Reporter: Anca Sarb >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > When transform(...) method is called on a LinearRegressionModel created > directly with the coefficients and intercepts, the following exception is > encountered. > {code:java} > java.util.NoSuchElementException: Failed to find a default value for loss at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at > org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at > org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at > org.apache.spark.ml.param.Params$class.$(params.scala:786) at > org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at > org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) > at > org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at > org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at > org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) > {code} > This is because validateAndTransformSchema() is called both during training > and scoring phases, but the checks against the training related params like > loss should really be performed during training phase only, I think, please > correct me if I'm missing anything :) > This issue was first reported for mleap > ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because > basically when we serialize the Spark transformers for mleap, we only > serialize the params that are relevant for scoring. We do have the option to > de-serialize the serialized transformers back into Spark for scoring again, > but in that case, we no longer have all the training params. > Test to reproduce in PR: [https://github.com/apache/spark/pull/24509] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27621) Calling transform() method on a LinearRegressionModel throws NoSuchElementException
Anca Sarb created SPARK-27621: - Summary: Calling transform() method on a LinearRegressionModel throws NoSuchElementException Key: SPARK-27621 URL: https://issues.apache.org/jira/browse/SPARK-27621 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.4.2, 2.4.1, 2.4.0, 2.3.3, 2.3.2, 2.3.1, 2.3.0, 2.3.4 Reporter: Anca Sarb When transform(...) method is called on a LinearRegressionModel created directly with the coefficients and intercepts, the following exception is encountered. {code:java} java.util.NoSuchElementException: Failed to find a default value for loss at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:786) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) at org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) {code} This is because validateAndTransformSchema() is called both during training and scoring phases, but the checks against the training related params like loss should really be performed during training phase only, I think, please correct me if I'm missing anything :) This issue was first reported for mleap ([combust/mleap#455|https://github.com/combust/mleap/issues/455]) because basically when we serialize the Spark transformers for mleap, we only serialize the params that are relevant for scoring. We do have the option to de-serialize the serialized transformers back into Spark for scoring again, but in that case, we no longer have all the training params. Test to reproduce in PR: [https://github.com/apache/spark/pull/24509] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org