[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480199#comment-16480199 ] Dongjoon Hyun commented on SPARK-21945: --- Thank you for verifying and including this in 2.3.1 RC2, [~vanzin]. > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16317) Add file filtering interface for FileFormat
[ https://issues.apache.org/jira/browse/SPARK-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480127#comment-16480127 ] Xiao Li commented on SPARK-16317: - We will not improve FileFormat since we are migrating the implementation to the data source v2 > Add file filtering interface for FileFormat > --- > > Key: SPARK-16317 > URL: https://issues.apache.org/jira/browse/SPARK-16317 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Priority: Minor > > {{FileFormat}} data sources like Parquet and Avro (provided by spark-avro) > have customized file filtering logics. For example, Parquet needs to filter > out summary files, while Avro provides a Hadoop configuration option to > filter out all files whose names don't end with ".avro". > It would be nice to have a general file filtering interface in {{FileFormat}} > to handle similar requirements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16317) Add file filtering interface for FileFormat
[ https://issues.apache.org/jira/browse/SPARK-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-16317. - Resolution: Won't Fix > Add file filtering interface for FileFormat > --- > > Key: SPARK-16317 > URL: https://issues.apache.org/jira/browse/SPARK-16317 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Priority: Minor > > {{FileFormat}} data sources like Parquet and Avro (provided by spark-avro) > have customized file filtering logics. For example, Parquet needs to filter > out summary files, while Avro provides a Hadoop configuration option to > filter out all files whose names don't end with ".avro". > It would be nice to have a general file filtering interface in {{FileFormat}} > to handle similar requirements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20758) Add Constant propagation optimization
[ https://issues.apache.org/jira/browse/SPARK-20758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinhua Fu updated SPARK-20758: -- Issue Type: New JIRA Project (was: Improvement) > Add Constant propagation optimization > - > > Key: SPARK-20758 > URL: https://issues.apache.org/jira/browse/SPARK-20758 > Project: Spark > Issue Type: New JIRA Project > Components: SQL >Affects Versions: 2.1.1 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.3.0 > > > Constant propagation involves substituting attributes which can be statically > evaluated in expressions. Its a pretty common optimization in compilers world. > eg. > {noformat} > SELECT * FROM table WHERE i = 5 AND j = i + 3 > {noformat} > can be re-written as: > {noformat} > SELECT * FROM table WHERE i = 5 AND j = 8 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22371: Fix Version/s: 2.3.1 > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Assignee: Artem Rudoy >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
[ https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-22884. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21358 [https://github.com/apache/spark/pull/21358] > ML test for StructuredStreaming: spark.ml.clustering > > > Key: SPARK-22884 > URL: https://issues.apache.org/jira/browse/SPARK-22884 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Sandor Murakozi >Priority: Major > Fix For: 3.0.0 > > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
[ https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-22884: - Assignee: Sandor Murakozi > ML test for StructuredStreaming: spark.ml.clustering > > > Key: SPARK-22884 > URL: https://issues.apache.org/jira/browse/SPARK-22884 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Sandor Murakozi >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
[ https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479814#comment-16479814 ] Apache Spark commented on SPARK-22884: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/21358 > ML test for StructuredStreaming: spark.ml.clustering > > > Key: SPARK-22884 > URL: https://issues.apache.org/jira/browse/SPARK-22884 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener
[ https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-24309: - Target Version/s: 2.3.1 > AsyncEventQueue should handle an interrupt from a Listener > -- > > Key: SPARK-24309 > URL: https://issues.apache.org/jira/browse/SPARK-24309 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Blocker > > AsyncEventQueue does not properly handle an interrupt from a Listener -- the > spark app won't even stop! > I observed this on an actual workload as the EventLoggingListener can > generate an interrupt from the underlying hdfs calls: > {noformat} > 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from > DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK] > to > DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]: > 10 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 > remote=/10.17.206.36:20002] > 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception > java.net.SocketTimeoutException: 10 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/10.17.206.35:33950 remote=/10.17.206.36:20002] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739) > 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener > EventLoggingListener threw an exception > [... a few more of these ...] > 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue > eventLog. > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79) > at > org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78) > {noformat} > When this happens, the AsyncEventQueue will continue to pile up events in its > queue, though its no longer processing them. And then in the call to stop, > it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't > stop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener
[ https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-24309: - Priority: Blocker (was: Major) > AsyncEventQueue should handle an interrupt from a Listener > -- > > Key: SPARK-24309 > URL: https://issues.apache.org/jira/browse/SPARK-24309 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Blocker > > AsyncEventQueue does not properly handle an interrupt from a Listener -- the > spark app won't even stop! > I observed this on an actual workload as the EventLoggingListener can > generate an interrupt from the underlying hdfs calls: > {noformat} > 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from > DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK] > to > DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]: > 10 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 > remote=/10.17.206.36:20002] > 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception > java.net.SocketTimeoutException: 10 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/10.17.206.35:33950 remote=/10.17.206.36:20002] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739) > 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener > EventLoggingListener threw an exception > [... a few more of these ...] > 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue > eventLog. > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79) > at > org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78) > {noformat} > When this happens, the AsyncEventQueue will continue to pile up events in its > queue, though its no longer processing them. And then in the call to stop, > it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't > stop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-21945: --- Fix Version/s: 2.3.1 > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479765#comment-16479765 ] Marcelo Vanzin commented on SPARK-21945: This works for cluster mode and pyspark (shell), but does not for spark-submit in client mode. Still this is an improvement, so I'll cherry pick. It would be nice to open a separate bug to close the remaining gap. > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file
[ https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24311: Assignee: Apache Spark > Refactor HDFSBackedStateStoreProvider to remove duplicated logic between > operations on delta file and snapshot file > --- > > Key: SPARK-24311 > URL: https://issues.apache.org/jira/browse/SPARK-24311 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > The structure of delta file and snapshot file is same, but the operations are > defined as individual methods which incurs duplicated logic between logic for > delta file and logic for snapshot file. > We can refactor to remove duplicated logic to ensure readability, as well as > guaranteeing that the structure of delta file and snapshot file keeps same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file
[ https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479728#comment-16479728 ] Apache Spark commented on SPARK-24311: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/21357 > Refactor HDFSBackedStateStoreProvider to remove duplicated logic between > operations on delta file and snapshot file > --- > > Key: SPARK-24311 > URL: https://issues.apache.org/jira/browse/SPARK-24311 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jungtaek Lim >Priority: Major > > The structure of delta file and snapshot file is same, but the operations are > defined as individual methods which incurs duplicated logic between logic for > delta file and logic for snapshot file. > We can refactor to remove duplicated logic to ensure readability, as well as > guaranteeing that the structure of delta file and snapshot file keeps same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file
[ https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-24311: - Description: The structure of delta file and snapshot file is same, but the operations are defined as individual methods which incurs duplicated logic between delta file and snapshot file. We can refactor to remove duplicated logic to ensure readability, as well as guaranteeing that the structure of delta file and snapshot file keeps same. was: The structure of delta file and snapshot file is same, but the operations are defined as individual methods which incurs duplicated logic between logic for delta file and logic for snapshot file. We can refactor to remove duplicated logic to ensure readability, as well as guaranteeing that the structure of delta file and snapshot file keeps same. > Refactor HDFSBackedStateStoreProvider to remove duplicated logic between > operations on delta file and snapshot file > --- > > Key: SPARK-24311 > URL: https://issues.apache.org/jira/browse/SPARK-24311 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jungtaek Lim >Priority: Major > > The structure of delta file and snapshot file is same, but the operations are > defined as individual methods which incurs duplicated logic between delta > file and snapshot file. > We can refactor to remove duplicated logic to ensure readability, as well as > guaranteeing that the structure of delta file and snapshot file keeps same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file
[ https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24311: Assignee: (was: Apache Spark) > Refactor HDFSBackedStateStoreProvider to remove duplicated logic between > operations on delta file and snapshot file > --- > > Key: SPARK-24311 > URL: https://issues.apache.org/jira/browse/SPARK-24311 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jungtaek Lim >Priority: Major > > The structure of delta file and snapshot file is same, but the operations are > defined as individual methods which incurs duplicated logic between logic for > delta file and logic for snapshot file. > We can refactor to remove duplicated logic to ensure readability, as well as > guaranteeing that the structure of delta file and snapshot file keeps same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24311) Refactor HDFSBackedStateStoreProvide to remove duplicated logic between operations on delta file and operations on snapshot file
Jungtaek Lim created SPARK-24311: Summary: Refactor HDFSBackedStateStoreProvide to remove duplicated logic between operations on delta file and operations on snapshot file Key: SPARK-24311 URL: https://issues.apache.org/jira/browse/SPARK-24311 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jungtaek Lim The structure of delta file and snapshot file is same, but the operations are defined as individual methods which incurs duplicated logic between logic for delta file and logic for snapshot file. We can refactor to remove duplicated logic to ensure readability, as well as guaranteeing that the structure of delta file and snapshot file keeps same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file
[ https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-24311: - Summary: Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file (was: Refactor HDFSBackedStateStoreProvide to remove duplicated logic between operations on delta file and snapshot file) > Refactor HDFSBackedStateStoreProvider to remove duplicated logic between > operations on delta file and snapshot file > --- > > Key: SPARK-24311 > URL: https://issues.apache.org/jira/browse/SPARK-24311 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jungtaek Lim >Priority: Major > > The structure of delta file and snapshot file is same, but the operations are > defined as individual methods which incurs duplicated logic between logic for > delta file and logic for snapshot file. > We can refactor to remove duplicated logic to ensure readability, as well as > guaranteeing that the structure of delta file and snapshot file keeps same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24311) Refactor HDFSBackedStateStoreProvide to remove duplicated logic between operations on delta file and snapshot file
[ https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-24311: - Summary: Refactor HDFSBackedStateStoreProvide to remove duplicated logic between operations on delta file and snapshot file (was: Refactor HDFSBackedStateStoreProvide to remove duplicated logic between operations on delta file and operations on snapshot file) > Refactor HDFSBackedStateStoreProvide to remove duplicated logic between > operations on delta file and snapshot file > -- > > Key: SPARK-24311 > URL: https://issues.apache.org/jira/browse/SPARK-24311 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jungtaek Lim >Priority: Major > > The structure of delta file and snapshot file is same, but the operations are > defined as individual methods which incurs duplicated logic between logic for > delta file and logic for snapshot file. > We can refactor to remove duplicated logic to ensure readability, as well as > guaranteeing that the structure of delta file and snapshot file keeps same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener
[ https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24309: Assignee: (was: Apache Spark) > AsyncEventQueue should handle an interrupt from a Listener > -- > > Key: SPARK-24309 > URL: https://issues.apache.org/jira/browse/SPARK-24309 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > > AsyncEventQueue does not properly handle an interrupt from a Listener -- the > spark app won't even stop! > I observed this on an actual workload as the EventLoggingListener can > generate an interrupt from the underlying hdfs calls: > {noformat} > 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from > DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK] > to > DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]: > 10 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 > remote=/10.17.206.36:20002] > 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception > java.net.SocketTimeoutException: 10 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/10.17.206.35:33950 remote=/10.17.206.36:20002] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739) > 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener > EventLoggingListener threw an exception > [... a few more of these ...] > 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue > eventLog. > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79) > at > org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78) > {noformat} > When this happens, the AsyncEventQueue will continue to pile up events in its > queue, though its no longer processing them. And then in the call to stop, > it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't > stop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener
[ https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24309: Assignee: Apache Spark > AsyncEventQueue should handle an interrupt from a Listener > -- > > Key: SPARK-24309 > URL: https://issues.apache.org/jira/browse/SPARK-24309 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Major > > AsyncEventQueue does not properly handle an interrupt from a Listener -- the > spark app won't even stop! > I observed this on an actual workload as the EventLoggingListener can > generate an interrupt from the underlying hdfs calls: > {noformat} > 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from > DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK] > to > DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]: > 10 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 > remote=/10.17.206.36:20002] > 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception > java.net.SocketTimeoutException: 10 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/10.17.206.35:33950 remote=/10.17.206.36:20002] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739) > 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener > EventLoggingListener threw an exception > [... a few more of these ...] > 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue > eventLog. > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79) > at > org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78) > {noformat} > When this happens, the AsyncEventQueue will continue to pile up events in its > queue, though its no longer processing them. And then in the call to stop, > it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't > stop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener
[ https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479688#comment-16479688 ] Apache Spark commented on SPARK-24309: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/21356 > AsyncEventQueue should handle an interrupt from a Listener > -- > > Key: SPARK-24309 > URL: https://issues.apache.org/jira/browse/SPARK-24309 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > > AsyncEventQueue does not properly handle an interrupt from a Listener -- the > spark app won't even stop! > I observed this on an actual workload as the EventLoggingListener can > generate an interrupt from the underlying hdfs calls: > {noformat} > 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from > DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK] > to > DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]: > 10 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 > remote=/10.17.206.36:20002] > 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception > java.net.SocketTimeoutException: 10 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/10.17.206.35:33950 remote=/10.17.206.36:20002] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739) > 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener > EventLoggingListener threw an exception > [... a few more of these ...] > 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue > eventLog. > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79) > at > org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78) > {noformat} > When this happens, the AsyncEventQueue will continue to pile up events in its > queue, though its no longer processing them. And then in the call to stop, > it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't > stop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479686#comment-16479686 ] Apache Spark commented on SPARK-24114: -- User 'MrBago' has created a pull request for this issue: https://github.com/apache/spark/pull/21344 > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24114: -- Comment: was deleted (was: User 'MrBago' has created a pull request for this issue: https://github.com/apache/spark/pull/21344) > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24310) Instrumentation for frequent pattern mining
[ https://issues.apache.org/jira/browse/SPARK-24310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-24310. --- Resolution: Fixed Fix Version/s: 2.4.0 > Instrumentation for frequent pattern mining > --- > > Key: SPARK-24310 > URL: https://issues.apache.org/jira/browse/SPARK-24310 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > See parent JIRA -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24310) Instrumentation for frequent pattern mining
[ https://issues.apache.org/jira/browse/SPARK-24310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479684#comment-16479684 ] Joseph K. Bradley commented on SPARK-24310: --- The PR for this was linked to the wrong JIRA, but I'm adding the link here for the record. > Instrumentation for frequent pattern mining > --- > > Key: SPARK-24310 > URL: https://issues.apache.org/jira/browse/SPARK-24310 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > See parent JIRA -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24310) Instrumentation for frequent pattern mining
Joseph K. Bradley created SPARK-24310: - Summary: Instrumentation for frequent pattern mining Key: SPARK-24310 URL: https://issues.apache.org/jira/browse/SPARK-24310 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: Joseph K. Bradley Assignee: Bago Amirbekian See parent JIRA -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-24114: - Assignee: (was: Bago Amirbekian) > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24114: -- Shepherd: (was: Joseph K. Bradley) > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener
Imran Rashid created SPARK-24309: Summary: AsyncEventQueue should handle an interrupt from a Listener Key: SPARK-24309 URL: https://issues.apache.org/jira/browse/SPARK-24309 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 2.3.0 Reporter: Imran Rashid AsyncEventQueue does not properly handle an interrupt from a Listener -- the spark app won't even stop! I observed this on an actual workload as the EventLoggingListener can generate an interrupt from the underlying hdfs calls: {noformat} 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK] to DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]: 10 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 remote=/10.17.206.36:20002] 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception java.net.SocketTimeoutException: 10 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 remote=/10.17.206.36:20002] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739) 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception [... a few more of these ...] 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue eventLog. java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78) {noformat} When this happens, the AsyncEventQueue will continue to pile up events in its queue, though its no longer processing them. And then in the call to stop, it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't stop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24114: -- Shepherd: Joseph K. Bradley > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: Bago Amirbekian >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-24114: - Assignee: Bago Amirbekian > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: Bago Amirbekian >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6238) Support shuffle where individual blocks might be > 2G
[ https://issues.apache.org/jira/browse/SPARK-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479645#comment-16479645 ] William Shen commented on SPARK-6238: - [~irashid], you marked this as a duplicate. Can you also mark which ticket this duplicates? I see that you marked this as duplicated by SPARK-5928. Did you mean that this duplicates SPARK-5928? (or if this duplicates SPARK-19659 per your comment). Thanks in advance for the clarification! > Support shuffle where individual blocks might be > 2G > - > > Key: SPARK-6238 > URL: https://issues.apache.org/jira/browse/SPARK-6238 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: jin xing >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479610#comment-16479610 ] Marcelo Vanzin commented on SPARK-21945: Let me run some tests and if they're ok, I'll cherry pick. > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode
[ https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479574#comment-16479574 ] Dongjoon Hyun commented on SPARK-21945: --- Hi, [~vanzin], [~tgraves], [~hyukjin.kwon]. I'm wondering if we can have this at 2.3.1 RC2, too? Or, at least, `branch-2.3` sometime? > pyspark --py-files doesn't work in yarn client mode > --- > > Key: SPARK-21945 > URL: https://issues.apache.org/jira/browse/SPARK-21945 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > I tried running pyspark with --py-files pythonfiles.zip but it doesn't > properly add the zip file to the PYTHONPATH. > I can work around by exporting PYTHONPATH. > Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand I don't see > this supported at all. If that is the case perhaps it should be moved to > improvement. > Note it works via spark-submit in both client and cluster mode to run python > script. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23195) Hint of cached data is lost
[ https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23195. - Resolution: Won't Fix > Hint of cached data is lost > --- > > Key: SPARK-23195 > URL: https://issues.apache.org/jira/browse/SPARK-23195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Priority: Major > > {noformat} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { > val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value") > val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value") > broadcast(df2).cache() > df2.collect() > val df3 = df1.join(df2, Seq("key"), "inner") > val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { > case b: BroadcastHashJoinExec => b > }.size > assert(numBroadCastHashJoin === 1) > } > {noformat} > The broadcast hint is not respected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23571) Delete auxiliary Kubernetes resources upon application completion
[ https://issues.apache.org/jira/browse/SPARK-23571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23571: --- Target Version/s: (was: 2.3.1) > Delete auxiliary Kubernetes resources upon application completion > - > > Key: SPARK-23571 > URL: https://issues.apache.org/jira/browse/SPARK-23571 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Yinan Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23571) Delete auxiliary Kubernetes resources upon application completion
[ https://issues.apache.org/jira/browse/SPARK-23571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479541#comment-16479541 ] Marcelo Vanzin commented on SPARK-23571: I removed the target version since the PR seems stalled (and doesn't even merge anymore). > Delete auxiliary Kubernetes resources upon application completion > - > > Key: SPARK-23571 > URL: https://issues.apache.org/jira/browse/SPARK-23571 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Yinan Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23486) LookupFunctions should not check the same function name more than once
[ https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479539#comment-16479539 ] Marcelo Vanzin commented on SPARK-23486: I removed the target versions since the PR seems stalled in review. > LookupFunctions should not check the same function name more than once > -- > > Key: SPARK-23486 > URL: https://issues.apache.org/jira/browse/SPARK-23486 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Cheng Lian >Priority: Major > Labels: starter > > For a query invoking the same function multiple times, the current > {{LookupFunctions}} rule performs a check for each invocation. For users > using Hive metastore as external catalog, this issues unnecessary metastore > accesses and can slow down the analysis phase quite a bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23486) LookupFunctions should not check the same function name more than once
[ https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23486: --- Target Version/s: (was: 2.3.1, 2.4.0) > LookupFunctions should not check the same function name more than once > -- > > Key: SPARK-23486 > URL: https://issues.apache.org/jira/browse/SPARK-23486 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Cheng Lian >Priority: Major > Labels: starter > > For a query invoking the same function multiple times, the current > {{LookupFunctions}} rule performs a check for each invocation. For users > using Hive metastore as external catalog, this issues unnecessary metastore > accesses and can slow down the analysis phase quite a bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23195) Hint of cached data is lost
[ https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479535#comment-16479535 ] Marcelo Vanzin commented on SPARK-23195: [~smilegator] doesn't look like you're actively working on this so I changed the bug's state and removed the target version. > Hint of cached data is lost > --- > > Key: SPARK-23195 > URL: https://issues.apache.org/jira/browse/SPARK-23195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Priority: Major > > {noformat} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { > val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value") > val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value") > broadcast(df2).cache() > df2.collect() > val df3 = df1.join(df2, Seq("key"), "inner") > val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { > case b: BroadcastHashJoinExec => b > }.size > assert(numBroadCastHashJoin === 1) > } > {noformat} > The broadcast hint is not respected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23195) Hint of cached data is lost
[ https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23195: --- Target Version/s: (was: 2.3.1) > Hint of cached data is lost > --- > > Key: SPARK-23195 > URL: https://issues.apache.org/jira/browse/SPARK-23195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Assignee: Marcelo Vanzin >Priority: Major > > {noformat} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { > val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value") > val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value") > broadcast(df2).cache() > df2.collect() > val df3 = df1.join(df2, Seq("key"), "inner") > val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { > case b: BroadcastHashJoinExec => b > }.size > assert(numBroadCastHashJoin === 1) > } > {noformat} > The broadcast hint is not respected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23195) Hint of cached data is lost
[ https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23195: -- Assignee: Marcelo Vanzin (was: Xiao Li) > Hint of cached data is lost > --- > > Key: SPARK-23195 > URL: https://issues.apache.org/jira/browse/SPARK-23195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Assignee: Marcelo Vanzin >Priority: Major > > {noformat} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { > val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value") > val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value") > broadcast(df2).cache() > df2.collect() > val df3 = df1.join(df2, Seq("key"), "inner") > val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { > case b: BroadcastHashJoinExec => b > }.size > assert(numBroadCastHashJoin === 1) > } > {noformat} > The broadcast hint is not respected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23195) Hint of cached data is lost
[ https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23195: -- Assignee: (was: Marcelo Vanzin) > Hint of cached data is lost > --- > > Key: SPARK-23195 > URL: https://issues.apache.org/jira/browse/SPARK-23195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Xiao Li >Priority: Major > > {noformat} > withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { > val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", > "value") > val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", > "value") > broadcast(df2).cache() > df2.collect() > val df3 = df1.join(df2, Seq("key"), "inner") > val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { > case b: BroadcastHashJoinExec => b > }.size > assert(numBroadCastHashJoin === 1) > } > {noformat} > The broadcast hint is not respected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24115) improve instrumentation for spark.ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-24115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-24115. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21340 [https://github.com/apache/spark/pull/21340] > improve instrumentation for spark.ml.tuning > --- > > Key: SPARK-24115 > URL: https://issues.apache.org/jira/browse/SPARK-24115 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24115) improve instrumentation for spark.ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-24115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-24115: - Assignee: Bago Amirbekian > improve instrumentation for spark.ml.tuning > --- > > Key: SPARK-24115 > URL: https://issues.apache.org/jira/browse/SPARK-24115 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: Bago Amirbekian >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes
[ https://issues.apache.org/jira/browse/SPARK-24308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24308: Assignee: (was: Apache Spark) > Handle DataReaderFactory to InputPartition renames in left over classes > --- > > Key: SPARK-24308 > URL: https://issues.apache.org/jira/browse/SPARK-24308 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Priority: Major > > SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> > InputPartitionReader. Some classes still reflects the old name and causes > confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes
[ https://issues.apache.org/jira/browse/SPARK-24308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24308: Assignee: Apache Spark > Handle DataReaderFactory to InputPartition renames in left over classes > --- > > Key: SPARK-24308 > URL: https://issues.apache.org/jira/browse/SPARK-24308 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Assignee: Apache Spark >Priority: Major > > SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> > InputPartitionReader. Some classes still reflects the old name and causes > confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes
[ https://issues.apache.org/jira/browse/SPARK-24308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479485#comment-16479485 ] Apache Spark commented on SPARK-24308: -- User 'arunmahadevan' has created a pull request for this issue: https://github.com/apache/spark/pull/21355 > Handle DataReaderFactory to InputPartition renames in left over classes > --- > > Key: SPARK-24308 > URL: https://issues.apache.org/jira/browse/SPARK-24308 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Priority: Major > > SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> > InputPartitionReader. Some classes still reflects the old name and causes > confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes
Arun Mahadevan created SPARK-24308: -- Summary: Handle DataReaderFactory to InputPartition renames in left over classes Key: SPARK-24308 URL: https://issues.apache.org/jira/browse/SPARK-24308 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Arun Mahadevan SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> InputPartitionReader. Some classes still reflects the old name and causes confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24307) Support sending messages over 2GB from memory
[ https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479439#comment-16479439 ] Imran Rashid commented on SPARK-24307: -- I have a really hacky version of this now, I plan to clean up and post a pr within a few days. > Support sending messages over 2GB from memory > - > > Key: SPARK-24307 > URL: https://issues.apache.org/jira/browse/SPARK-24307 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > > Spark's networking layer supports sending messages backed by a {{FileRegion}} > or a {{ByteBuf}}. Sending large FileRegion's works, as netty supports large > FileRegions. However, {{ByteBuf}} is limited to 2GB. This is particularly > a problem for sending large datasets that are already in memory, eg. cached > RDD blocks. > eg. if you try to replicate a block stored in memory that is over 2 GB, you > will see an exception like: > {noformat} > 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC > 7420542363232096629 to xyz.com/172.31.113.213:44358: > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723) > at > io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: > -1294617291 (expected: 0 <= readerIndex <= writerIndex <= > capacity(-1294617291)) > at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129) > at > io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688) > at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110) > at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359) > at > org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87) > at > org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95) > at > org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) > ... 17 more > {noformat} > A simple solution to this is to create a "FileRegion" which is backed by a > {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB > in memory). > A drawback to this approach is that
[jira] [Commented] (SPARK-24283) Make standard scaler work without legacy MLlib
[ https://issues.apache.org/jira/browse/SPARK-24283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479405#comment-16479405 ] Huaxin Gao commented on SPARK-24283: [~holdenk] Hi Holden, all I need to do is to change {code:java} val transformer: Vector => Vector = v => scaler.transform(OldVectors.fromML(v)).asML{code} to {code:java} val transformer: Vector => Vector = v => scaler.transform(v).asML{code} so it will use the implicit converting, right? > Make standard scaler work without legacy MLlib > -- > > Key: SPARK-24283 > URL: https://issues.apache.org/jira/browse/SPARK-24283 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > Labels: starter > > Currently StandardScaler converts Spark ML vectors to MLlib vectors during > prediction, we should skip that step. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24307) Support sending messages over 2GB from memory
Imran Rashid created SPARK-24307: Summary: Support sending messages over 2GB from memory Key: SPARK-24307 URL: https://issues.apache.org/jira/browse/SPARK-24307 Project: Spark Issue Type: Sub-task Components: Block Manager, Spark Core Affects Versions: 2.3.0 Reporter: Imran Rashid Spark's networking layer supports sending messages backed by a {{FileRegion}} or a {{ByteBuf}}. Sending large FileRegion's works, as netty supports large FileRegions. However, {{ByteBuf}} is limited to 2GB. This is particularly a problem for sending large datasets that are already in memory, eg. cached RDD blocks. eg. if you try to replicate a block stored in memory that is over 2 GB, you will see an exception like: {noformat} 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 7420542363232096629 to xyz.com/172.31.113.213:44358: io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= writerIndex <= capacity(-1294617291)) io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= writerIndex <= capacity(-1294617291)) at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723) at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081) at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128) at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= writerIndex <= capacity(-1294617291)) at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129) at io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688) at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110) at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359) at org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87) at org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95) at org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52) at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58) at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) ... 17 more {noformat} A simple solution to this is to create a "FileRegion" which is backed by a {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB in memory). A drawback to this approach is that blocks that are cached in memory as deserialized values would need to have the *entire* block serialized into memory before it can be pushed.However, that would involve a larger change to the block manager as well, and is not strictly necessary, so can be handled separately as a performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Updated] (SPARK-24285) Flaky test: ContinuousSuite.query without test harness
[ https://issues.apache.org/jira/browse/SPARK-24285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24285: -- Description: - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/ was:- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/ > Flaky test: ContinuousSuite.query without test harness > -- > > Key: SPARK-24285 > URL: https://issues.apache.org/jira/browse/SPARK-24285 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24304) Scheduler changes for continuous processing shuffle support
[ https://issues.apache.org/jira/browse/SPARK-24304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479133#comment-16479133 ] Apache Spark commented on SPARK-24304: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/21353 > Scheduler changes for continuous processing shuffle support > --- > > Key: SPARK-24304 > URL: https://issues.apache.org/jira/browse/SPARK-24304 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Li Yuanjian >Priority: Major > > Both docs including the change requirements of scheduler, add a new jira and > collect our comments here. > https://docs.google.com/document/d/14cGJ75v9myznywtB35ytEqL9wHy9xfZRv06B6g2tUgI/edit#bookmark=id.tjcvevpyuv4x > https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit?disco=B4z058g -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24304) Scheduler changes for continuous processing shuffle support
[ https://issues.apache.org/jira/browse/SPARK-24304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24304: Assignee: (was: Apache Spark) > Scheduler changes for continuous processing shuffle support > --- > > Key: SPARK-24304 > URL: https://issues.apache.org/jira/browse/SPARK-24304 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Li Yuanjian >Priority: Major > > Both docs including the change requirements of scheduler, add a new jira and > collect our comments here. > https://docs.google.com/document/d/14cGJ75v9myznywtB35ytEqL9wHy9xfZRv06B6g2tUgI/edit#bookmark=id.tjcvevpyuv4x > https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit?disco=B4z058g -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24304) Scheduler changes for continuous processing shuffle support
[ https://issues.apache.org/jira/browse/SPARK-24304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24304: Assignee: Apache Spark > Scheduler changes for continuous processing shuffle support > --- > > Key: SPARK-24304 > URL: https://issues.apache.org/jira/browse/SPARK-24304 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Li Yuanjian >Assignee: Apache Spark >Priority: Major > > Both docs including the change requirements of scheduler, add a new jira and > collect our comments here. > https://docs.google.com/document/d/14cGJ75v9myznywtB35ytEqL9wHy9xfZRv06B6g2tUgI/edit#bookmark=id.tjcvevpyuv4x > https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit?disco=B4z058g -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24193) Sort by disk when number of limit is big in TakeOrderedAndProjectExec
[ https://issues.apache.org/jira/browse/SPARK-24193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24193: --- Assignee: jin xing > Sort by disk when number of limit is big in TakeOrderedAndProjectExec > - > > Key: SPARK-24193 > URL: https://issues.apache.org/jira/browse/SPARK-24193 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: jin xing >Assignee: jin xing >Priority: Major > Fix For: 2.4.0 > > > Physical plan of "_select colA from t order by colB limit M_" is > _TakeOrderedAndProject_; > Currently _TakeOrderedAndProject_ sorts data in memory, see > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L158 > > Shall we add a config -- if the number of limit (M) is too big, we can sort > by disk ? Thus memory issue can be resolved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24193) Sort by disk when number of limit is big in TakeOrderedAndProjectExec
[ https://issues.apache.org/jira/browse/SPARK-24193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24193. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21252 [https://github.com/apache/spark/pull/21252] > Sort by disk when number of limit is big in TakeOrderedAndProjectExec > - > > Key: SPARK-24193 > URL: https://issues.apache.org/jira/browse/SPARK-24193 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: jin xing >Assignee: jin xing >Priority: Major > Fix For: 2.4.0 > > > Physical plan of "_select colA from t order by colB limit M_" is > _TakeOrderedAndProject_; > Currently _TakeOrderedAndProject_ sorts data in memory, see > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L158 > > Shall we add a config -- if the number of limit (M) is too big, we can sort > by disk ? Thus memory issue can be resolved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
[ https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24002: Fix Version/s: 2.3.1 > Task not serializable caused by > org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes > > > Key: SPARK-24002 > URL: https://issues.apache.org/jira/browse/SPARK-24002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > Having two queries one is a 1000-line SQL query and a 3000-line SQL query. > Need to run at least one hour with a heavy write workload to reproduce once. > {code} > Py4JJavaError: An error occurred while calling o153.sql. > : org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021) > at > org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646) > at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) > at py4j.Gateway.invoke(Gateway.java:293) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:226) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) > at > org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) > at > org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144) > ... > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190) > ... 23 more > Caused by: java.util.concurrent.ExecutionException: > org.apache.spark.SparkException: Task not serializable > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179) > ... 276 more > Caused by: org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2380) > at >
[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing
[ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479122#comment-16479122 ] Apache Spark commented on SPARK-24036: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/21353 > Stateful operators in continuous processing > --- > > Key: SPARK-24036 > URL: https://issues.apache.org/jira/browse/SPARK-24036 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > The first iteration of continuous processing in Spark 2.3 does not work with > stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24036) Stateful operators in continuous processing
[ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24036: Assignee: (was: Apache Spark) > Stateful operators in continuous processing > --- > > Key: SPARK-24036 > URL: https://issues.apache.org/jira/browse/SPARK-24036 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > The first iteration of continuous processing in Spark 2.3 does not work with > stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24036) Stateful operators in continuous processing
[ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24036: Assignee: Apache Spark > Stateful operators in continuous processing > --- > > Key: SPARK-24036 > URL: https://issues.apache.org/jira/browse/SPARK-24036 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Major > > The first iteration of continuous processing in Spark 2.3 does not work with > stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy)
[ https://issues.apache.org/jira/browse/SPARK-24306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24306: - Summary: Sort a Dataset with a lambda (like RDD.sortBy) (was: Sort a Dataset with a lambda (like RDD.sortBy() )) > Sort a Dataset with a lambda (like RDD.sortBy) > -- > > Key: SPARK-24306 > URL: https://issues.apache.org/jira/browse/SPARK-24306 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Al M >Priority: Minor > > Dataset has many useful functions that do not require hard coded column names > (e.g. flatMapGroups, map). Currently the sort() function requires us to pass > the column name as a string. > It would be nice to have a function similar to the sortBy in RDD where i can > define the keys in a lambda e.g. > {code:java} > ds.sortBy(record => record.id){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy() )
[ https://issues.apache.org/jira/browse/SPARK-24306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24306: - Summary: Sort a Dataset with a lambda (like RDD.sortBy() ) (was: Sort a Dataset with a lambda (like RDD.sortBy()) > Sort a Dataset with a lambda (like RDD.sortBy() ) > - > > Key: SPARK-24306 > URL: https://issues.apache.org/jira/browse/SPARK-24306 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Al M >Priority: Minor > > Dataset has many useful functions that do not require hard coded column names > (e.g. flatMapGroups, map). Currently the sort() function requires us to pass > the column name as a string. > It would be nice to have a function similar to the sortBy in RDD where i can > define the keys in a lambda e.g. > {code:java} > ds.sortBy(record => record.id){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean
[ https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23922: --- Assignee: Marco Gaido > High-order function: arrays_overlap(x, y) → boolean > --- > > Key: SPARK-23922 > URL: https://issues.apache.org/jira/browse/SPARK-23922 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > Tests if arrays x and y have any any non-null elements in common. Returns > null if there are no non-null elements in common but either array contains > null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean
[ https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23922. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21028 [https://github.com/apache/spark/pull/21028] > High-order function: arrays_overlap(x, y) → boolean > --- > > Key: SPARK-23922 > URL: https://issues.apache.org/jira/browse/SPARK-23922 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > Tests if arrays x and y have any any non-null elements in common. Returns > null if there are no non-null elements in common but either array contains > null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy()
Al M created SPARK-24306: Summary: Sort a Dataset with a lambda (like RDD.sortBy() Key: SPARK-24306 URL: https://issues.apache.org/jira/browse/SPARK-24306 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Al M Dataset has many useful functions that do not require hard coded column names (e.g. flatMapGroups, map). Currently the sort() function requires us to pass the column name as a string. It would be nice to have a function similar to the sortBy in RDD where i can define the keys in a lambda e.g. {code:java} ds.sortBy(record => record.id){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24305) Avoid serialization of private fields in new collection expressions
[ https://issues.apache.org/jira/browse/SPARK-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478946#comment-16478946 ] Apache Spark commented on SPARK-24305: -- User 'mn-mikke' has created a pull request for this issue: https://github.com/apache/spark/pull/21352 > Avoid serialization of private fields in new collection expressions > --- > > Key: SPARK-24305 > URL: https://issues.apache.org/jira/browse/SPARK-24305 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Priority: Minor > > Make sure that private fields of expression case classes in > _collectionOperations.scala_ are not serialized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24305) Avoid serialization of private fields in new collection expressions
[ https://issues.apache.org/jira/browse/SPARK-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24305: Assignee: (was: Apache Spark) > Avoid serialization of private fields in new collection expressions > --- > > Key: SPARK-24305 > URL: https://issues.apache.org/jira/browse/SPARK-24305 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Priority: Minor > > Make sure that private fields of expression case classes in > _collectionOperations.scala_ are not serialized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24305) Avoid serialization of private fields in new collection expressions
[ https://issues.apache.org/jira/browse/SPARK-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24305: Assignee: Apache Spark > Avoid serialization of private fields in new collection expressions > --- > > Key: SPARK-24305 > URL: https://issues.apache.org/jira/browse/SPARK-24305 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Assignee: Apache Spark >Priority: Minor > > Make sure that private fields of expression case classes in > _collectionOperations.scala_ are not serialized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24305) Avoid serialization of private fields in new collection expressions
Marek Novotny created SPARK-24305: - Summary: Avoid serialization of private fields in new collection expressions Key: SPARK-24305 URL: https://issues.apache.org/jira/browse/SPARK-24305 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Marek Novotny Make sure that private fields of expression case classes in _collectionOperations.scala_ are not serialized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22371: --- Assignee: Artem Rudoy > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Assignee: Artem Rudoy >Priority: Major > Fix For: 2.4.0 > > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22371. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21114 [https://github.com/apache/spark/pull/21114] > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Priority: Major > Fix For: 2.4.0 > > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
[ https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478811#comment-16478811 ] Sergey Rubtsov commented on SPARK-19228: Java 8 contains new java.time module, also it can fix an old bug with parse string to SQL's timestamp value in microseconds accuracy: https://issues.apache.org/jira/browse/SPARK-10681.x > inferSchema function processed csv date column as string and "dateFormat" > DataSource option is ignored > -- > > Key: SPARK-19228 > URL: https://issues.apache.org/jira/browse/SPARK-19228 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.0 >Reporter: Sergey Rubtsov >Priority: Major > Labels: easyfix > Original Estimate: 6h > Remaining Estimate: 6h > > I need to process user.csv like this: > {code} > id,project,started,ended > sergey.rubtsov,project0,12/12/2012,10/10/2015 > {code} > When I add date format options: > {code} > Dataset users = spark.read().format("csv").option("mode", > "PERMISSIVE").option("header", "true") > .option("inferSchema", > "true").option("dateFormat", > "dd/MM/").load("src/main/resources/user.csv"); > users.printSchema(); > {code} > expected scheme should be > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: date (nullable = true) > |-- ended: date (nullable = true) > {code} > but the actual result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: string (nullable = true) > |-- ended: string (nullable = true) > {code} > This mean that date processed as string and "dateFormat" option is ignored. > If I add option > {code} > .option("timestampFormat", "dd/MM/") > {code} > result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: timestamp (nullable = true) > |-- ended: timestamp (nullable = true) > {code} > I think, the issue is somewhere in object CSVInferSchema, function > inferField, lines 80-97 and > method "tryParseDate" need to be added before/after "tryParseTimestamp", or > date/timestamp process logic need to be changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24304) Scheduler changes for continuous processing shuffle support
Li Yuanjian created SPARK-24304: --- Summary: Scheduler changes for continuous processing shuffle support Key: SPARK-24304 URL: https://issues.apache.org/jira/browse/SPARK-24304 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Li Yuanjian Both docs including the change requirements of scheduler, add a new jira and collect our comments here. https://docs.google.com/document/d/14cGJ75v9myznywtB35ytEqL9wHy9xfZRv06B6g2tUgI/edit#bookmark=id.tjcvevpyuv4x https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit?disco=B4z058g -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24298) PCAModel Memory in Pipeline
[ https://issues.apache.org/jira/browse/SPARK-24298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478802#comment-16478802 ] Marco Gaido commented on SPARK-24298: - May you please provide a small program/simple list of steps to reproduce the problem you are seeing? Thanks. > PCAModel Memory in Pipeline > --- > > Key: SPARK-24298 > URL: https://issues.apache.org/jira/browse/SPARK-24298 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Hayri Volkan Agun >Priority: Major > > Java memory aggregates when using PCAModel transformer in classification > pipeline. Although it is a dimensionality reduction method memory increases > continously during classification. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478671#comment-16478671 ] Wenchen Fan commented on SPARK-24288: - You can take a look at `ResolveReferences#dedupRight`, which breaks AnalysisBarrier. My feeling is wherever we break AnalysisBarrier, we should also break optimize barrier, otherwise we can't resolve the plan. > Enable preventing predicate pushdown > > > Key: SPARK-24288 > URL: https://issues.apache.org/jira/browse/SPARK-24288 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Tomasz Gawęda >Priority: Major > Attachments: SPARK-24288.simple.patch > > > Issue discussed on Mailing List: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] > While working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times).There are also few more workarounds, but it would be great if Spark > will support preventing predicate pushdown by user. > > For example: df.withAnalysisBarrier().where(...) ? > > Note, that this should not be a global configuration option. If I read 2 > DataFrames, df1 and df2, I would like to specify that df1 should not have > some predicates pushed down, but some may be, but df2 should have all > predicates pushed down, even if target query joins df1 and df2. As far as I > understand Spark optimizer, if we use functions like `withAnalysisBarrier` > and put AnalysisBarrier explicitly in logical plan, then predicates won't be > pushed down on this particular DataFrames and PP will be still possible on > the second one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478619#comment-16478619 ] H Lu commented on SPARK-23928: -- Update: running unit tests and having a spark contributor in the team to review the code before I submit a pull request. Thanks. > High-order function: shuffle(x) → array > --- > > Key: SPARK-23928 > URL: https://issues.apache.org/jira/browse/SPARK-23928 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Generate a random permutation of the given array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24065) Issue with the property IgnoreLeadingWhiteSpace
[ https://issues.apache.org/jira/browse/SPARK-24065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varsha Chandrashekar updated SPARK-24065: - Component/s: Spark Shell > Issue with the property IgnoreLeadingWhiteSpace > --- > > Key: SPARK-24065 > URL: https://issues.apache.org/jira/browse/SPARK-24065 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell >Affects Versions: 2.2.0 >Reporter: Varsha Chandrashekar >Priority: Major > Attachments: lds1.txt, spark-shell-result.PNG > > > "IgnoreLeadingWhiteSpace" property is not working properly for a corner case, > Consider the data below: > ||"Col1"||"Col2"||"Col3"|| > | "A" | "Mark" | "US" | > | "B" | "Luke" | "UK" | > Each cell conatins leadingWhiteSpaces and trailingWhiteSpaces, when i upload > the dataset by passing "ignoreTrailingWhiteSpace" as true, the trailing > spaces are being trimmed which is right. But, when i pass > "ignoreLeadingWhiteSpace" as true it is not trimming the leading spaces. > The scenario was testes/executed in spark-shell. Refer the result below, > case 1: scala> var > df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",false).option("ignoreTrailingWhiteSpace",false).load("C:\\Users\\vachandrashekar > Desktop > lds1.txt") > df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more > field] > scala> df.show() > +---+++ > |Col1|Col2|Col3| > +---+++ > | "A" | "Mark" | "US" | > | "B" | "Luke" | "UK" | > +---+++ > case 2: scala> var > df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",true).option("ignoreTrailingWhiteSpace",false).load("C:\\Users\\vachandrashekar > Desktop > lds1.txt") > df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more > field] > scala> df.show() > +--+---++- > |Col1|Col2|Col3| > +--+---++- > | A|Mark|US| > | B| Luke| UK| > +--+---++- > case 3: scala> var > df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",false).option("ignoreTrailingWhiteSpace",true).load("C:\\Users\\vachandrashekar > Desktop > lds1.txt") > df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more > field] > scala> df.show() > ++---++- > |col1|Col2|Col3| > ++---++- > | "A"| "Mark"| "US"| > | "B"| "Luke"| "UK"| > ++---++- > > Analysis: > Case 1 : Works fine, with "ignoreLeadingWhiteSpace" and > "ignoreTrailingWhiteSpace" as false, the data is previewed as in the file. > > Case 2 : Not working!! with "ignoreLeadingWhiteSpace" as true and > "ignoreTrailingWhiteSpace" as false results in trimming trailing white spaces > and retains leading white spaces. > It does trim leading white space but only for two columns in the first row > excluding the first column in that row. > > Case 3 : Works fine, with "ignoreLeadingWhiteSpace" as false and > "ignoreTrailingWhiteSpace" as true, only trailing white spaces have been > trimmed and leading white spaces are retained. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
[ https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478596#comment-16478596 ] Apache Spark commented on SPARK-24002: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/21351 > Task not serializable caused by > org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes > > > Key: SPARK-24002 > URL: https://issues.apache.org/jira/browse/SPARK-24002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.4.0 > > > Having two queries one is a 1000-line SQL query and a 3000-line SQL query. > Need to run at least one hour with a heavy write workload to reproduce once. > {code} > Py4JJavaError: An error occurred while calling o153.sql. > : org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021) > at > org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646) > at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) > at py4j.Gateway.invoke(Gateway.java:293) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:226) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) > at > org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) > at > org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144) > ... > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190) > ... 23 more > Caused by: java.util.concurrent.ExecutionException: > org.apache.spark.SparkException: Task not serializable > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179) > ... 276 more > Caused by: org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330) > at