date:20180517

[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-17 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480199#comment-16480199
 ] 

Dongjoon Hyun commented on SPARK-21945:
---

Thank you for verifying and including this in 2.3.1 RC2, [~vanzin].

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16317) Add file filtering interface for FileFormat

2018-05-17 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480127#comment-16480127
 ] 

Xiao Li commented on SPARK-16317:
-

We will not improve FileFormat since we are migrating the implementation to the 
data source v2

> Add file filtering interface for FileFormat
> ---
>
> Key: SPARK-16317
> URL: https://issues.apache.org/jira/browse/SPARK-16317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Minor
>
> {{FileFormat}} data sources like Parquet and Avro (provided by spark-avro) 
> have customized file filtering logics. For example, Parquet needs to filter 
> out summary files, while Avro provides a Hadoop configuration option to 
> filter out all files whose names don't end with ".avro".
> It would be nice to have a general file filtering interface in {{FileFormat}} 
> to handle similar requirements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16317) Add file filtering interface for FileFormat

2018-05-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-16317.
-
Resolution: Won't Fix

> Add file filtering interface for FileFormat
> ---
>
> Key: SPARK-16317
> URL: https://issues.apache.org/jira/browse/SPARK-16317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Minor
>
> {{FileFormat}} data sources like Parquet and Avro (provided by spark-avro) 
> have customized file filtering logics. For example, Parquet needs to filter 
> out summary files, while Avro provides a Hadoop configuration option to 
> filter out all files whose names don't end with ".avro".
> It would be nice to have a general file filtering interface in {{FileFormat}} 
> to handle similar requirements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20758) Add Constant propagation optimization

2018-05-17 Thread Jinhua Fu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinhua Fu updated SPARK-20758:
--
Issue Type: New JIRA Project  (was: Improvement)

> Add Constant propagation optimization
> -
>
> Key: SPARK-20758
> URL: https://issues.apache.org/jira/browse/SPARK-20758
> Project: Spark
>  Issue Type: New JIRA Project
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.3.0
>
>
> Constant propagation involves substituting attributes which can be statically 
> evaluated in expressions. Its a pretty common optimization in compilers world.
> eg.
> {noformat}
> SELECT * FROM table WHERE i = 5 AND j = i + 3
> {noformat}
> can be re-written as:
> {noformat}
> SELECT * FROM table WHERE i = 5 AND j = 8
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-05-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22371:

Fix Version/s: 2.3.1

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Assignee: Artem Rudoy
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-05-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-22884.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21358
[https://github.com/apache/spark/pull/21358]

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Sandor Murakozi
>Priority: Major
> Fix For: 3.0.0
>
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-05-17 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22884:
-

Assignee: Sandor Murakozi

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Sandor Murakozi
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-05-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479814#comment-16479814
 ] 

Apache Spark commented on SPARK-22884:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/21358

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener

2018-05-17 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-24309:
-
Target Version/s: 2.3.1

> AsyncEventQueue should handle an interrupt from a Listener
> --
>
> Key: SPARK-24309
> URL: https://issues.apache.org/jira/browse/SPARK-24309
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> AsyncEventQueue does not properly handle an interrupt from a Listener -- the 
> spark app won't even stop!
> I observed this on an actual workload as the EventLoggingListener can 
> generate an interrupt from the underlying hdfs calls:
> {noformat}
> 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from 
> DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK]
>  to 
> DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]:
>  10 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 
> remote=/10.17.206.36:20002]
> 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception
> java.net.SocketTimeoutException: 10 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.17.206.35:33950 remote=/10.17.206.36:20002]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739)
> 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener 
> EventLoggingListener threw an exception
> [... a few more of these ...]
> 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue 
> eventLog.
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
> at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79)
> at 
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78)
> {noformat}
> When this happens, the AsyncEventQueue will continue to pile up events in its 
> queue, though its no longer processing them.  And then in the call to stop, 
> it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't 
> stop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener

2018-05-17 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-24309:
-
Priority: Blocker  (was: Major)

> AsyncEventQueue should handle an interrupt from a Listener
> --
>
> Key: SPARK-24309
> URL: https://issues.apache.org/jira/browse/SPARK-24309
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> AsyncEventQueue does not properly handle an interrupt from a Listener -- the 
> spark app won't even stop!
> I observed this on an actual workload as the EventLoggingListener can 
> generate an interrupt from the underlying hdfs calls:
> {noformat}
> 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from 
> DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK]
>  to 
> DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]:
>  10 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 
> remote=/10.17.206.36:20002]
> 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception
> java.net.SocketTimeoutException: 10 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.17.206.35:33950 remote=/10.17.206.36:20002]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739)
> 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener 
> EventLoggingListener threw an exception
> [... a few more of these ...]
> 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue 
> eventLog.
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
> at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79)
> at 
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78)
> {noformat}
> When this happens, the AsyncEventQueue will continue to pile up events in its 
> queue, though its no longer processing them.  And then in the call to stop, 
> it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't 
> stop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-17 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-21945:
---
Fix Version/s: 2.3.1

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-17 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479765#comment-16479765
 ] 

Marcelo Vanzin commented on SPARK-21945:


This works for cluster mode and pyspark (shell), but does not for spark-submit 
in client mode. Still this is an improvement, so I'll cherry pick.

It would be nice to open a separate bug to close the remaining gap.

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24311:


Assignee: Apache Spark

> Refactor HDFSBackedStateStoreProvider to remove duplicated logic between 
> operations on delta file and snapshot file
> ---
>
> Key: SPARK-24311
> URL: https://issues.apache.org/jira/browse/SPARK-24311
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> The structure of delta file and snapshot file is same, but the operations are 
> defined as individual methods which incurs duplicated logic between logic for 
> delta file and logic for snapshot file.
> We can refactor to remove duplicated logic to ensure readability, as well as 
> guaranteeing that the structure of delta file and snapshot file keeps same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file

2018-05-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479728#comment-16479728
 ] 

Apache Spark commented on SPARK-24311:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/21357

> Refactor HDFSBackedStateStoreProvider to remove duplicated logic between 
> operations on delta file and snapshot file
> ---
>
> Key: SPARK-24311
> URL: https://issues.apache.org/jira/browse/SPARK-24311
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> The structure of delta file and snapshot file is same, but the operations are 
> defined as individual methods which incurs duplicated logic between logic for 
> delta file and logic for snapshot file.
> We can refactor to remove duplicated logic to ensure readability, as well as 
> guaranteeing that the structure of delta file and snapshot file keeps same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file

2018-05-17 Thread Jungtaek Lim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-24311:
-
Description: 
The structure of delta file and snapshot file is same, but the operations are 
defined as individual methods which incurs duplicated logic between delta file 
and snapshot file.

We can refactor to remove duplicated logic to ensure readability, as well as 
guaranteeing that the structure of delta file and snapshot file keeps same.

  was:
The structure of delta file and snapshot file is same, but the operations are 
defined as individual methods which incurs duplicated logic between logic for 
delta file and logic for snapshot file.

We can refactor to remove duplicated logic to ensure readability, as well as 
guaranteeing that the structure of delta file and snapshot file keeps same.


> Refactor HDFSBackedStateStoreProvider to remove duplicated logic between 
> operations on delta file and snapshot file
> ---
>
> Key: SPARK-24311
> URL: https://issues.apache.org/jira/browse/SPARK-24311
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> The structure of delta file and snapshot file is same, but the operations are 
> defined as individual methods which incurs duplicated logic between delta 
> file and snapshot file.
> We can refactor to remove duplicated logic to ensure readability, as well as 
> guaranteeing that the structure of delta file and snapshot file keeps same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24311:


Assignee: (was: Apache Spark)

> Refactor HDFSBackedStateStoreProvider to remove duplicated logic between 
> operations on delta file and snapshot file
> ---
>
> Key: SPARK-24311
> URL: https://issues.apache.org/jira/browse/SPARK-24311
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> The structure of delta file and snapshot file is same, but the operations are 
> defined as individual methods which incurs duplicated logic between logic for 
> delta file and logic for snapshot file.
> We can refactor to remove duplicated logic to ensure readability, as well as 
> guaranteeing that the structure of delta file and snapshot file keeps same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24311) Refactor HDFSBackedStateStoreProvide to remove duplicated logic between operations on delta file and operations on snapshot file

2018-05-17 Thread Jungtaek Lim (JIRA)

Jungtaek Lim created SPARK-24311:


 Summary: Refactor HDFSBackedStateStoreProvide to remove duplicated 
logic between operations on delta file and operations on snapshot file
 Key: SPARK-24311
 URL: https://issues.apache.org/jira/browse/SPARK-24311
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jungtaek Lim


The structure of delta file and snapshot file is same, but the operations are 
defined as individual methods which incurs duplicated logic between logic for 
delta file and logic for snapshot file.

We can refactor to remove duplicated logic to ensure readability, as well as 
guaranteeing that the structure of delta file and snapshot file keeps same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24311) Refactor HDFSBackedStateStoreProvider to remove duplicated logic between operations on delta file and snapshot file

2018-05-17 Thread Jungtaek Lim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-24311:
-
Summary: Refactor HDFSBackedStateStoreProvider to remove duplicated logic 
between operations on delta file and snapshot file  (was: Refactor 
HDFSBackedStateStoreProvide to remove duplicated logic between operations on 
delta file and snapshot file)

> Refactor HDFSBackedStateStoreProvider to remove duplicated logic between 
> operations on delta file and snapshot file
> ---
>
> Key: SPARK-24311
> URL: https://issues.apache.org/jira/browse/SPARK-24311
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> The structure of delta file and snapshot file is same, but the operations are 
> defined as individual methods which incurs duplicated logic between logic for 
> delta file and logic for snapshot file.
> We can refactor to remove duplicated logic to ensure readability, as well as 
> guaranteeing that the structure of delta file and snapshot file keeps same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24311) Refactor HDFSBackedStateStoreProvide to remove duplicated logic between operations on delta file and snapshot file

2018-05-17 Thread Jungtaek Lim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-24311:
-
Summary: Refactor HDFSBackedStateStoreProvide to remove duplicated logic 
between operations on delta file and snapshot file  (was: Refactor 
HDFSBackedStateStoreProvide to remove duplicated logic between operations on 
delta file and operations on snapshot file)

> Refactor HDFSBackedStateStoreProvide to remove duplicated logic between 
> operations on delta file and snapshot file
> --
>
> Key: SPARK-24311
> URL: https://issues.apache.org/jira/browse/SPARK-24311
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> The structure of delta file and snapshot file is same, but the operations are 
> defined as individual methods which incurs duplicated logic between logic for 
> delta file and logic for snapshot file.
> We can refactor to remove duplicated logic to ensure readability, as well as 
> guaranteeing that the structure of delta file and snapshot file keeps same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24309:


Assignee: (was: Apache Spark)

> AsyncEventQueue should handle an interrupt from a Listener
> --
>
> Key: SPARK-24309
> URL: https://issues.apache.org/jira/browse/SPARK-24309
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> AsyncEventQueue does not properly handle an interrupt from a Listener -- the 
> spark app won't even stop!
> I observed this on an actual workload as the EventLoggingListener can 
> generate an interrupt from the underlying hdfs calls:
> {noformat}
> 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from 
> DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK]
>  to 
> DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]:
>  10 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 
> remote=/10.17.206.36:20002]
> 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception
> java.net.SocketTimeoutException: 10 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.17.206.35:33950 remote=/10.17.206.36:20002]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739)
> 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener 
> EventLoggingListener threw an exception
> [... a few more of these ...]
> 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue 
> eventLog.
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
> at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79)
> at 
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78)
> {noformat}
> When this happens, the AsyncEventQueue will continue to pile up events in its 
> queue, though its no longer processing them.  And then in the call to stop, 
> it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't 
> stop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24309:


Assignee: Apache Spark

> AsyncEventQueue should handle an interrupt from a Listener
> --
>
> Key: SPARK-24309
> URL: https://issues.apache.org/jira/browse/SPARK-24309
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>
> AsyncEventQueue does not properly handle an interrupt from a Listener -- the 
> spark app won't even stop!
> I observed this on an actual workload as the EventLoggingListener can 
> generate an interrupt from the underlying hdfs calls:
> {noformat}
> 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from 
> DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK]
>  to 
> DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]:
>  10 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 
> remote=/10.17.206.36:20002]
> 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception
> java.net.SocketTimeoutException: 10 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.17.206.35:33950 remote=/10.17.206.36:20002]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739)
> 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener 
> EventLoggingListener threw an exception
> [... a few more of these ...]
> 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue 
> eventLog.
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
> at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79)
> at 
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78)
> {noformat}
> When this happens, the AsyncEventQueue will continue to pile up events in its 
> queue, though its no longer processing them.  And then in the call to stop, 
> it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't 
> stop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener

2018-05-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479688#comment-16479688
 ] 

Apache Spark commented on SPARK-24309:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/21356

> AsyncEventQueue should handle an interrupt from a Listener
> --
>
> Key: SPARK-24309
> URL: https://issues.apache.org/jira/browse/SPARK-24309
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> AsyncEventQueue does not properly handle an interrupt from a Listener -- the 
> spark app won't even stop!
> I observed this on an actual workload as the EventLoggingListener can 
> generate an interrupt from the underlying hdfs calls:
> {noformat}
> 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from 
> DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK]
>  to 
> DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]:
>  10 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 
> remote=/10.17.206.36:20002]
> 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception
> java.net.SocketTimeoutException: 10 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.17.206.35:33950 remote=/10.17.206.36:20002]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739)
> 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener 
> EventLoggingListener threw an exception
> [... a few more of these ...]
> 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue 
> eventLog.
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
> at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79)
> at 
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78)
> {noformat}
> When this happens, the AsyncEventQueue will continue to pile up events in its 
> queue, though its no longer processing them.  And then in the call to stop, 
> it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't 
> stop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-05-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479686#comment-16479686
 ] 

Apache Spark commented on SPARK-24114:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/21344

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-05-17 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24114:
--
Comment: was deleted

(was: User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/21344)

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24310) Instrumentation for frequent pattern mining

2018-05-17 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24310.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> Instrumentation for frequent pattern mining
> ---
>
> Key: SPARK-24310
> URL: https://issues.apache.org/jira/browse/SPARK-24310
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24310) Instrumentation for frequent pattern mining

2018-05-17 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479684#comment-16479684
 ] 

Joseph K. Bradley commented on SPARK-24310:
---

The PR for this was linked to the wrong JIRA, but I'm adding the link here for 
the record.

> Instrumentation for frequent pattern mining
> ---
>
> Key: SPARK-24310
> URL: https://issues.apache.org/jira/browse/SPARK-24310
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24310) Instrumentation for frequent pattern mining

2018-05-17 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-24310:
-

 Summary: Instrumentation for frequent pattern mining
 Key: SPARK-24310
 URL: https://issues.apache.org/jira/browse/SPARK-24310
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley
Assignee: Bago Amirbekian


See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-05-17 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-24114:
-

Assignee: (was: Bago Amirbekian)

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-05-17 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24114:
--
Shepherd:   (was: Joseph K. Bradley)

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener

2018-05-17 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-24309:


 Summary: AsyncEventQueue should handle an interrupt from a Listener
 Key: SPARK-24309
 URL: https://issues.apache.org/jira/browse/SPARK-24309
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 2.3.0
Reporter: Imran Rashid


AsyncEventQueue does not properly handle an interrupt from a Listener -- the 
spark app won't even stop!

I observed this on an actual workload as the EventLoggingListener can generate 
an interrupt from the underlying hdfs calls:

{noformat}
18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from 
DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK]
 to 
DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]:
 10 millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 
remote=/10.17.206.36:20002]
18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception
java.net.SocketTimeoutException: 10 millis timeout while waiting for 
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.17.206.35:33950 remote=/10.17.206.36:20002]
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739)
18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener 
EventLoggingListener threw an exception
[... a few more of these ...]
18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue 
eventLog.
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
at 
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78)
{noformat}

When this happens, the AsyncEventQueue will continue to pile up events in its 
queue, though its no longer processing them.  And then in the call to stop, 
it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't 
stop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-05-17 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24114:
--
Shepherd: Joseph K. Bradley

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: Bago Amirbekian
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-05-17 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-24114:
-

Assignee: Bago Amirbekian

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: Bago Amirbekian
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6238) Support shuffle where individual blocks might be > 2G

2018-05-17 Thread William Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479645#comment-16479645
 ] 

William Shen commented on SPARK-6238:
-

[~irashid], you marked this as a duplicate. Can you also mark which ticket this 
duplicates? I see that you marked this as duplicated by SPARK-5928. Did you 
mean that this duplicates SPARK-5928? (or if this duplicates SPARK-19659 per 
your comment). Thanks in advance for the clarification!

> Support shuffle where individual blocks might be > 2G
> -
>
> Key: SPARK-6238
> URL: https://issues.apache.org/jira/browse/SPARK-6238
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: jin xing
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-17 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479610#comment-16479610
 ] 

Marcelo Vanzin commented on SPARK-21945:


Let me run some tests and if they're ok, I'll cherry pick.

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-17 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479574#comment-16479574
 ] 

Dongjoon Hyun commented on SPARK-21945:
---

Hi, [~vanzin], [~tgraves], [~hyukjin.kwon].
I'm wondering if we can have this at 2.3.1 RC2, too? Or, at least, `branch-2.3` 
sometime?

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23195) Hint of cached data is lost

2018-05-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23195.
-
Resolution: Won't Fix

> Hint of cached data is lost
> ---
>
> Key: SPARK-23195
> URL: https://issues.apache.org/jira/browse/SPARK-23195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> {noformat}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value")
>   val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value")
>   broadcast(df2).cache()
>   df2.collect()
>   val df3 = df1.join(df2, Seq("key"), "inner")
>   val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
> case b: BroadcastHashJoinExec => b
>   }.size
>   assert(numBroadCastHashJoin === 1)
> }
> {noformat}
> The broadcast hint is not respected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23571) Delete auxiliary Kubernetes resources upon application completion

2018-05-17 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23571:
---
Target Version/s:   (was: 2.3.1)

> Delete auxiliary Kubernetes resources upon application completion
> -
>
> Key: SPARK-23571
> URL: https://issues.apache.org/jira/browse/SPARK-23571
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yinan Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23571) Delete auxiliary Kubernetes resources upon application completion

2018-05-17 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479541#comment-16479541
 ] 

Marcelo Vanzin commented on SPARK-23571:


I removed the target version since the PR seems stalled (and doesn't even merge 
anymore).

> Delete auxiliary Kubernetes resources upon application completion
> -
>
> Key: SPARK-23571
> URL: https://issues.apache.org/jira/browse/SPARK-23571
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yinan Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-05-17 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479539#comment-16479539
 ] 

Marcelo Vanzin commented on SPARK-23486:


I removed the target versions since the PR seems stalled in review.

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Cheng Lian
>Priority: Major
>  Labels: starter
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-05-17 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23486:
---
Target Version/s:   (was: 2.3.1, 2.4.0)

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Cheng Lian
>Priority: Major
>  Labels: starter
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23195) Hint of cached data is lost

2018-05-17 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479535#comment-16479535
 ] 

Marcelo Vanzin commented on SPARK-23195:


[~smilegator] doesn't look like you're actively working on this so I changed 
the bug's state and removed the target version.

> Hint of cached data is lost
> ---
>
> Key: SPARK-23195
> URL: https://issues.apache.org/jira/browse/SPARK-23195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> {noformat}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value")
>   val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value")
>   broadcast(df2).cache()
>   df2.collect()
>   val df3 = df1.join(df2, Seq("key"), "inner")
>   val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
> case b: BroadcastHashJoinExec => b
>   }.size
>   assert(numBroadCastHashJoin === 1)
> }
> {noformat}
> The broadcast hint is not respected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23195) Hint of cached data is lost

2018-05-17 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23195:
---
Target Version/s:   (was: 2.3.1)

> Hint of cached data is lost
> ---
>
> Key: SPARK-23195
> URL: https://issues.apache.org/jira/browse/SPARK-23195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Marcelo Vanzin
>Priority: Major
>
> {noformat}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value")
>   val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value")
>   broadcast(df2).cache()
>   df2.collect()
>   val df3 = df1.join(df2, Seq("key"), "inner")
>   val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
> case b: BroadcastHashJoinExec => b
>   }.size
>   assert(numBroadCastHashJoin === 1)
> }
> {noformat}
> The broadcast hint is not respected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23195) Hint of cached data is lost

2018-05-17 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23195:
--

Assignee: Marcelo Vanzin  (was: Xiao Li)

> Hint of cached data is lost
> ---
>
> Key: SPARK-23195
> URL: https://issues.apache.org/jira/browse/SPARK-23195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Marcelo Vanzin
>Priority: Major
>
> {noformat}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value")
>   val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value")
>   broadcast(df2).cache()
>   df2.collect()
>   val df3 = df1.join(df2, Seq("key"), "inner")
>   val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
> case b: BroadcastHashJoinExec => b
>   }.size
>   assert(numBroadCastHashJoin === 1)
> }
> {noformat}
> The broadcast hint is not respected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23195) Hint of cached data is lost

2018-05-17 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23195:
--

Assignee: (was: Marcelo Vanzin)

> Hint of cached data is lost
> ---
>
> Key: SPARK-23195
> URL: https://issues.apache.org/jira/browse/SPARK-23195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> {noformat}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value")
>   val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value")
>   broadcast(df2).cache()
>   df2.collect()
>   val df3 = df1.join(df2, Seq("key"), "inner")
>   val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
> case b: BroadcastHashJoinExec => b
>   }.size
>   assert(numBroadCastHashJoin === 1)
> }
> {noformat}
> The broadcast hint is not respected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24115) improve instrumentation for spark.ml.tuning

2018-05-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24115.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21340
[https://github.com/apache/spark/pull/21340]

> improve instrumentation for spark.ml.tuning
> ---
>
> Key: SPARK-24115
> URL: https://issues.apache.org/jira/browse/SPARK-24115
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24115) improve instrumentation for spark.ml.tuning

2018-05-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24115:
-

Assignee: Bago Amirbekian

> improve instrumentation for spark.ml.tuning
> ---
>
> Key: SPARK-24115
> URL: https://issues.apache.org/jira/browse/SPARK-24115
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24308:


Assignee: (was: Apache Spark)

> Handle DataReaderFactory to InputPartition renames in left over classes
> ---
>
> Key: SPARK-24308
> URL: https://issues.apache.org/jira/browse/SPARK-24308
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Major
>
> SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> 
> InputPartitionReader. Some classes still reflects the old name and causes 
> confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24308:


Assignee: Apache Spark

> Handle DataReaderFactory to InputPartition renames in left over classes
> ---
>
> Key: SPARK-24308
> URL: https://issues.apache.org/jira/browse/SPARK-24308
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> 
> InputPartitionReader. Some classes still reflects the old name and causes 
> confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes

2018-05-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479485#comment-16479485
 ] 

Apache Spark commented on SPARK-24308:
--

User 'arunmahadevan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21355

> Handle DataReaderFactory to InputPartition renames in left over classes
> ---
>
> Key: SPARK-24308
> URL: https://issues.apache.org/jira/browse/SPARK-24308
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Major
>
> SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> 
> InputPartitionReader. Some classes still reflects the old name and causes 
> confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes

2018-05-17 Thread Arun Mahadevan (JIRA)

Arun Mahadevan created SPARK-24308:
--

 Summary: Handle DataReaderFactory to InputPartition renames in 
left over classes
 Key: SPARK-24308
 URL: https://issues.apache.org/jira/browse/SPARK-24308
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Arun Mahadevan


SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> 
InputPartitionReader. Some classes still reflects the old name and causes 
confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24307) Support sending messages over 2GB from memory

2018-05-17 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479439#comment-16479439
 ] 

Imran Rashid commented on SPARK-24307:
--

I have a really hacky version of this now, I plan to clean up and post a pr 
within a few days.

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in memory). 
>  A drawback to this approach is that

[jira] [Commented] (SPARK-24283) Make standard scaler work without legacy MLlib

2018-05-17 Thread Huaxin Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479405#comment-16479405
 ] 

Huaxin Gao commented on SPARK-24283:


[~holdenk] Hi Holden, all I need to do is to change 
{code:java}
val transformer: Vector => Vector = v => 
scaler.transform(OldVectors.fromML(v)).asML{code}
 to
{code:java}
val transformer: Vector => Vector = v => scaler.transform(v).asML{code}
so it will use the implicit converting, right?

> Make standard scaler work without legacy MLlib
> --
>
> Key: SPARK-24283
> URL: https://issues.apache.org/jira/browse/SPARK-24283
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> Currently StandardScaler converts Spark ML vectors to MLlib vectors during 
> prediction, we should skip that step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24307) Support sending messages over 2GB from memory

2018-05-17 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-24307:


 Summary: Support sending messages over 2GB from memory
 Key: SPARK-24307
 URL: https://issues.apache.org/jira/browse/SPARK-24307
 Project: Spark
  Issue Type: Sub-task
  Components: Block Manager, Spark Core
Affects Versions: 2.3.0
Reporter: Imran Rashid


Spark's networking layer supports sending messages backed by a {{FileRegion}} 
or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly a 
problem for sending large datasets that are already in memory, eg.  cached RDD 
blocks.

eg. if you try to replicate a block stored in memory that is over 2 GB, you 
will see an exception like:

{noformat}
18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
7420542363232096629 to xyz.com/172.31.113.213:44358: 
io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
writerIndex <= capacity(-1294617291))
io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
writerIndex <= capacity(-1294617291))
at 
io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at 
io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
at 
io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
at 
io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at 
io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
at 
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
-1294617291 (expected: 0 <= readerIndex <= writerIndex <= capacity(-1294617291))
at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
at io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
at 
org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
at 
org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
at 
org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
at 
org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
at 
org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
at 
io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
... 17 more

{noformat}

A simple solution to this is to create a "FileRegion" which is backed by a 
{{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
in memory). 

 A drawback to this approach is that blocks that are cached in memory as 
deserialized values would need to have the *entire* block serialized into 
memory before it can be pushed.However, that would involve a larger change 
to the block manager as well, and is not strictly necessary, so can be handled 
separately as a performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail:

[jira] [Updated] (SPARK-24285) Flaky test: ContinuousSuite.query without test harness

2018-05-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24285:
--
Description: 
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/

  was:- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/


> Flaky test: ContinuousSuite.query without test harness
> --
>
> Key: SPARK-24285
> URL: https://issues.apache.org/jira/browse/SPARK-24285
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/370/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/373/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24304) Scheduler changes for continuous processing shuffle support

2018-05-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479133#comment-16479133
 ] 

Apache Spark commented on SPARK-24304:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/21353

> Scheduler changes for continuous processing shuffle support
> ---
>
> Key: SPARK-24304
> URL: https://issues.apache.org/jira/browse/SPARK-24304
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Li Yuanjian
>Priority: Major
>
> Both docs including the change requirements of scheduler, add a new jira and 
> collect our comments here.  
> https://docs.google.com/document/d/14cGJ75v9myznywtB35ytEqL9wHy9xfZRv06B6g2tUgI/edit#bookmark=id.tjcvevpyuv4x
> https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit?disco=B4z058g



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24304) Scheduler changes for continuous processing shuffle support

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24304:


Assignee: (was: Apache Spark)

> Scheduler changes for continuous processing shuffle support
> ---
>
> Key: SPARK-24304
> URL: https://issues.apache.org/jira/browse/SPARK-24304
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Li Yuanjian
>Priority: Major
>
> Both docs including the change requirements of scheduler, add a new jira and 
> collect our comments here.  
> https://docs.google.com/document/d/14cGJ75v9myznywtB35ytEqL9wHy9xfZRv06B6g2tUgI/edit#bookmark=id.tjcvevpyuv4x
> https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit?disco=B4z058g



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24304) Scheduler changes for continuous processing shuffle support

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24304:


Assignee: Apache Spark

> Scheduler changes for continuous processing shuffle support
> ---
>
> Key: SPARK-24304
> URL: https://issues.apache.org/jira/browse/SPARK-24304
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Li Yuanjian
>Assignee: Apache Spark
>Priority: Major
>
> Both docs including the change requirements of scheduler, add a new jira and 
> collect our comments here.  
> https://docs.google.com/document/d/14cGJ75v9myznywtB35ytEqL9wHy9xfZRv06B6g2tUgI/edit#bookmark=id.tjcvevpyuv4x
> https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit?disco=B4z058g



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24193) Sort by disk when number of limit is big in TakeOrderedAndProjectExec

2018-05-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24193:
---

Assignee: jin xing

> Sort by disk when number of limit is big in TakeOrderedAndProjectExec
> -
>
> Key: SPARK-24193
> URL: https://issues.apache.org/jira/browse/SPARK-24193
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.4.0
>
>
> Physical plan of  "_select colA from t order by colB limit M_" is 
> _TakeOrderedAndProject_;
> Currently _TakeOrderedAndProject_ sorts data in memory, see 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L158
>  
> Shall we add a config -- if the number of limit (M) is too big, we can sort 
> by disk ? Thus memory issue can be resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24193) Sort by disk when number of limit is big in TakeOrderedAndProjectExec

2018-05-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24193.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21252
[https://github.com/apache/spark/pull/21252]

> Sort by disk when number of limit is big in TakeOrderedAndProjectExec
> -
>
> Key: SPARK-24193
> URL: https://issues.apache.org/jira/browse/SPARK-24193
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.4.0
>
>
> Physical plan of  "_select colA from t order by colB limit M_" is 
> _TakeOrderedAndProject_;
> Currently _TakeOrderedAndProject_ sorts data in memory, see 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L158
>  
> Shall we add a config -- if the number of limit (M) is too big, we can sort 
> by disk ? Thus memory issue can be resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes

2018-05-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24002:

Fix Version/s: 2.3.1

> Task not serializable caused by 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
> 
>
> Key: SPARK-24002
> URL: https://issues.apache.org/jira/browse/SPARK-24002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Having two queries one is a 1000-line SQL query and a 3000-line SQL query. 
> Need to run at least one hour with a heavy write workload to reproduce once. 
> {code}
> Py4JJavaError: An error occurred while calling o153.sql.
> : org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646)
>   at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
>   at py4j.Gateway.invoke(Gateway.java:293)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:226)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: 
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144)
>   ...
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
>   ... 23 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.spark.SparkException: Task not serializable
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179)
>   ... 276 more
> Caused by: org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2380)
>   at 
>

[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2018-05-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479122#comment-16479122
 ] 

Apache Spark commented on SPARK-24036:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/21353

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24036) Stateful operators in continuous processing

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24036:


Assignee: (was: Apache Spark)

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24036) Stateful operators in continuous processing

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24036:


Assignee: Apache Spark

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy)

2018-05-17 Thread Al M (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24306:
-
Summary: Sort a Dataset with a lambda (like RDD.sortBy)  (was: Sort a 
Dataset with a lambda (like RDD.sortBy() ))

> Sort a Dataset with a lambda (like RDD.sortBy)
> --
>
> Key: SPARK-24306
> URL: https://issues.apache.org/jira/browse/SPARK-24306
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Al M
>Priority: Minor
>
> Dataset has many useful functions that do not require hard coded column names 
> (e.g. flatMapGroups, map).  Currently the sort() function requires us to pass 
> the column name as a string. 
> It would be nice to have a function similar to the sortBy in RDD where i can 
> define the keys in a lambda e.g.
> {code:java}
> ds.sortBy(record => record.id){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy() )

2018-05-17 Thread Al M (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24306:
-
Summary: Sort a Dataset with a lambda (like RDD.sortBy() )  (was: Sort a 
Dataset with a lambda (like RDD.sortBy())

> Sort a Dataset with a lambda (like RDD.sortBy() )
> -
>
> Key: SPARK-24306
> URL: https://issues.apache.org/jira/browse/SPARK-24306
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Al M
>Priority: Minor
>
> Dataset has many useful functions that do not require hard coded column names 
> (e.g. flatMapGroups, map).  Currently the sort() function requires us to pass 
> the column name as a string. 
> It would be nice to have a function similar to the sortBy in RDD where i can 
> define the keys in a lambda e.g.
> {code:java}
> ds.sortBy(record => record.id){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean

2018-05-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23922:
---

Assignee: Marco Gaido

> High-order function: arrays_overlap(x, y) → boolean
> ---
>
> Key: SPARK-23922
> URL: https://issues.apache.org/jira/browse/SPARK-23922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> Tests if arrays x and y have any any non-null elements in common. Returns 
> null if there are no non-null elements in common but either array contains 
> null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean

2018-05-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23922.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21028
[https://github.com/apache/spark/pull/21028]

> High-order function: arrays_overlap(x, y) → boolean
> ---
>
> Key: SPARK-23922
> URL: https://issues.apache.org/jira/browse/SPARK-23922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> Tests if arrays x and y have any any non-null elements in common. Returns 
> null if there are no non-null elements in common but either array contains 
> null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy()

2018-05-17 Thread Al M (JIRA)

Al M created SPARK-24306:


 Summary: Sort a Dataset with a lambda (like RDD.sortBy()
 Key: SPARK-24306
 URL: https://issues.apache.org/jira/browse/SPARK-24306
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Al M


Dataset has many useful functions that do not require hard coded column names 
(e.g. flatMapGroups, map).  Currently the sort() function requires us to pass 
the column name as a string. 

It would be nice to have a function similar to the sortBy in RDD where i can 
define the keys in a lambda e.g.
{code:java}
ds.sortBy(record => record.id){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24305) Avoid serialization of private fields in new collection expressions

2018-05-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478946#comment-16478946
 ] 

Apache Spark commented on SPARK-24305:
--

User 'mn-mikke' has created a pull request for this issue:
https://github.com/apache/spark/pull/21352

> Avoid serialization of private fields in new collection expressions
> ---
>
> Key: SPARK-24305
> URL: https://issues.apache.org/jira/browse/SPARK-24305
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Minor
>
> Make sure that private fields of expression case classes in 
> _collectionOperations.scala_ are not serialized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24305) Avoid serialization of private fields in new collection expressions

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24305:


Assignee: (was: Apache Spark)

> Avoid serialization of private fields in new collection expressions
> ---
>
> Key: SPARK-24305
> URL: https://issues.apache.org/jira/browse/SPARK-24305
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Minor
>
> Make sure that private fields of expression case classes in 
> _collectionOperations.scala_ are not serialized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24305) Avoid serialization of private fields in new collection expressions

2018-05-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24305:


Assignee: Apache Spark

> Avoid serialization of private fields in new collection expressions
> ---
>
> Key: SPARK-24305
> URL: https://issues.apache.org/jira/browse/SPARK-24305
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Apache Spark
>Priority: Minor
>
> Make sure that private fields of expression case classes in 
> _collectionOperations.scala_ are not serialized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24305) Avoid serialization of private fields in new collection expressions

2018-05-17 Thread Marek Novotny (JIRA)

Marek Novotny created SPARK-24305:
-

 Summary: Avoid serialization of private fields in new collection 
expressions
 Key: SPARK-24305
 URL: https://issues.apache.org/jira/browse/SPARK-24305
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Marek Novotny


Make sure that private fields of expression case classes in 
_collectionOperations.scala_ are not serialized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-05-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22371:
---

Assignee: Artem Rudoy

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Assignee: Artem Rudoy
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-05-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22371.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21114
[https://github.com/apache/spark/pull/21114]

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2018-05-17 Thread Sergey Rubtsov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478811#comment-16478811
 ] 

Sergey Rubtsov commented on SPARK-19228:


Java 8 contains new java.time module, also it can fix an old bug with parse 
string to SQL's timestamp value in microseconds accuracy:

https://issues.apache.org/jira/browse/SPARK-10681.x

> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I need to process user.csv like this:
> {code}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored.
> If I add option 
> {code}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is: 
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: timestamp (nullable = true)
>  |-- ended: timestamp (nullable = true)
> {code}
> I think, the issue is somewhere in object CSVInferSchema, function 
> inferField, lines 80-97 and
> method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
> date/timestamp process logic need to be changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24304) Scheduler changes for continuous processing shuffle support

2018-05-17 Thread Li Yuanjian (JIRA)

Li Yuanjian created SPARK-24304:
---

 Summary: Scheduler changes for continuous processing shuffle 
support
 Key: SPARK-24304
 URL: https://issues.apache.org/jira/browse/SPARK-24304
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Li Yuanjian


Both docs including the change requirements of scheduler, add a new jira and 
collect our comments here.  
https://docs.google.com/document/d/14cGJ75v9myznywtB35ytEqL9wHy9xfZRv06B6g2tUgI/edit#bookmark=id.tjcvevpyuv4x
https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit?disco=B4z058g



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24298) PCAModel Memory in Pipeline

2018-05-17 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478802#comment-16478802
 ] 

Marco Gaido commented on SPARK-24298:
-

May you please provide a small program/simple list of steps to reproduce the 
problem you are seeing? Thanks.

> PCAModel Memory in Pipeline
> ---
>
> Key: SPARK-24298
> URL: https://issues.apache.org/jira/browse/SPARK-24298
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Hayri Volkan Agun
>Priority: Major
>
> Java memory aggregates when using PCAModel transformer in classification 
> pipeline. Although it is a dimensionality reduction method memory increases 
> continously during classification.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown

2018-05-17 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478671#comment-16478671
 ] 

Wenchen Fan commented on SPARK-24288:
-

You can take a look at `ResolveReferences#dedupRight`, which breaks 
AnalysisBarrier. My feeling is wherever we break AnalysisBarrier, we should 
also break optimize barrier, otherwise we can't resolve the plan.

> Enable preventing predicate pushdown
> 
>
> Key: SPARK-24288
> URL: https://issues.apache.org/jira/browse/SPARK-24288
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tomasz Gawęda
>Priority: Major
> Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array

2018-05-17 Thread H Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478619#comment-16478619
 ] 

H Lu commented on SPARK-23928:
--

Update: running unit tests and having a spark contributor in the team to review 
the code before I submit a pull request. Thanks.

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24065) Issue with the property IgnoreLeadingWhiteSpace

2018-05-17 Thread Varsha Chandrashekar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varsha Chandrashekar updated SPARK-24065:
-
Component/s: Spark Shell

> Issue with the property IgnoreLeadingWhiteSpace
> ---
>
> Key: SPARK-24065
> URL: https://issues.apache.org/jira/browse/SPARK-24065
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.2.0
>Reporter: Varsha Chandrashekar
>Priority: Major
> Attachments: lds1.txt, spark-shell-result.PNG
>
>
> "IgnoreLeadingWhiteSpace" property is not working properly for a corner case, 
> Consider the data below:
> ||"Col1"||"Col2"||"Col3"||
> |   "A"  |   "Mark"   |   "US"   |
> |   "B"   |   "Luke"   |   "UK"   |
> Each cell conatins leadingWhiteSpaces and trailingWhiteSpaces, when i upload 
> the dataset by passing "ignoreTrailingWhiteSpace" as true, the trailing 
> spaces are being trimmed which is right. But, when i pass 
> "ignoreLeadingWhiteSpace" as true it is not trimming the leading spaces.
> The scenario was testes/executed in spark-shell. Refer the result below,
> case 1: scala> var 
> df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",false).option("ignoreTrailingWhiteSpace",false).load("C:\\Users\\vachandrashekar
> Desktop
>  lds1.txt")
>  df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more 
> field]
> scala> df.show()
>  +---+++
> |Col1|Col2|Col3|
> +---+++
> |  "A"  |  "Mark"  |  "US"  |
> |  "B"  |  "Luke" |  "UK" |
> +---+++
> case 2: scala> var 
> df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",true).option("ignoreTrailingWhiteSpace",false).load("C:\\Users\\vachandrashekar
> Desktop
>  lds1.txt")
>  df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more 
> field]
> scala> df.show()
>  +--+---++-
> |Col1|Col2|Col3|
> +--+---++-
> |    A|Mark|US|
> |    B|   Luke|   UK|
> +--+---++-
> case 3: scala> var 
> df=spark.read.format("com.databricks.spark.csv").option("delimiter",",").option("qualifier","\"").option("escape","\\").option("header","true").option("inferSchema","true").option("ignoreLeadingWhiteSpace",false).option("ignoreTrailingWhiteSpace",true).load("C:\\Users\\vachandrashekar
> Desktop
>  lds1.txt")
>  df: org.apache.spark.sql.DataFrame = [col1: string, Col2: string ... 1 more 
> field]
> scala> df.show()
>  ++---++-
> |col1|Col2|Col3|
> ++---++-
> |    "A"|   "Mark"|   "US"|
> |    "B"|   "Luke"|   "UK"|
> ++---++-
>  
> Analysis:
> Case 1 : Works fine, with "ignoreLeadingWhiteSpace" and 
> "ignoreTrailingWhiteSpace" as false, the data is previewed as in the file.
>  
> Case 2 : Not working!! with "ignoreLeadingWhiteSpace" as true and 
> "ignoreTrailingWhiteSpace" as false results in trimming trailing white spaces 
> and retains leading white spaces. 
> It does trim leading white space but only for two columns in the first row 
> excluding the first column in that row.
>  
> Case 3 : Works fine, with "ignoreLeadingWhiteSpace" as false and 
> "ignoreTrailingWhiteSpace" as true, only trailing white spaces have been 
> trimmed and leading white spaces are retained.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes

2018-05-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478596#comment-16478596
 ] 

Apache Spark commented on SPARK-24002:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/21351

> Task not serializable caused by 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
> 
>
> Key: SPARK-24002
> URL: https://issues.apache.org/jira/browse/SPARK-24002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> Having two queries one is a 1000-line SQL query and a 3000-line SQL query. 
> Need to run at least one hour with a heavy write workload to reproduce once. 
> {code}
> Py4JJavaError: An error occurred while calling o153.sql.
> : org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646)
>   at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
>   at py4j.Gateway.invoke(Gateway.java:293)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:226)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: 
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144)
>   ...
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
>   ... 23 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.spark.SparkException: Task not serializable
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179)
>   ... 276 more
> Caused by: org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
>   at

81 matches

Mail list logo