[jira] [Updated] (SPARK-21703) Why RPC message are transferred with header and body separately in TCP frame
[ https://issues.apache.org/jira/browse/SPARK-21703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] neoremind updated SPARK-21703: -- Description: After seeing the details of how spark leverage netty, I found one question, typically RPC message wire format would have a header+payload structure, and netty uses a TransportFrameDecoder to deal with how to determine a complete message from remote peer. But after using Wireshark sniffing tool, I found that the message are sent separately with header and then followed by body, although this works fine, but for underlying TCP there would be ACK segments sent back to acknowledge, there might be a little bit redundancy since we can sent them together and the header are usually very small. The main reason can be found in MessageWithHeader class, since transferTo method write tow times for header and body. Could someone help me understand the background story on how to implement in such way? Thanks! was: After seeing the details of how spark leverage netty, I found one question, typically RPC message wire format would have a header+payload structure, and netty uses a TransportFrameDecoder to deal with how to determine a complete message from remote peer. But after using Wireshark sniffing tool, I found that the message are sent separately with header and then followed by body, although this works fine, but for underlying TCP there would be ACK segments sent back to acknowledge, there might be a little bit redundancy since we can sent them together and the header are usually very small. The main reason can be found in MessageWithHeader class, since transferTo method write tow times for header and body. Could some one help me understand the background story on how to implement in such way? Thanks! > Why RPC message are transferred with header and body separately in TCP frame > > > Key: SPARK-21703 > URL: https://issues.apache.org/jira/browse/SPARK-21703 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: neoremind >Priority: Trivial > Labels: RPC > > After seeing the details of how spark leverage netty, I found one question, > typically RPC message wire format would have a header+payload structure, and > netty uses a TransportFrameDecoder to deal with how to determine a complete > message from remote peer. But after using Wireshark sniffing tool, I found > that the message are sent separately with header and then followed by body, > although this works fine, but for underlying TCP there would be ACK segments > sent back to acknowledge, there might be a little bit redundancy since we can > sent them together and the header are usually very small. > The main reason can be found in MessageWithHeader class, since transferTo > method write tow times for header and body. > Could someone help me understand the background story on how to implement in > such way? Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21703) Why RPC message are transferred with header and body separately in TCP frame
[ https://issues.apache.org/jira/browse/SPARK-21703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] neoremind updated SPARK-21703: -- Description: After seeing the details of how spark leverage netty, I found one question, typically RPC message wire format would have a header+payload structure, and netty uses a TransportFrameDecoder to deal with how to determine a complete message from remote peer. But after using Wireshark sniffing tool, I found that the message are sent separately with header and then followed by body, although this works fine, but for underlying TCP there would be ACK segments sent back to acknowledge, there might be a little bit redundancy since we can sent them together and the header are usually very small. The main reason can be found in MessageWithHeader class, since transferTo method write tow times for header and body. Could someone help me understand the background story on why to implement in such way? Thanks! was: After seeing the details of how spark leverage netty, I found one question, typically RPC message wire format would have a header+payload structure, and netty uses a TransportFrameDecoder to deal with how to determine a complete message from remote peer. But after using Wireshark sniffing tool, I found that the message are sent separately with header and then followed by body, although this works fine, but for underlying TCP there would be ACK segments sent back to acknowledge, there might be a little bit redundancy since we can sent them together and the header are usually very small. The main reason can be found in MessageWithHeader class, since transferTo method write tow times for header and body. Could someone help me understand the background story on how to implement in such way? Thanks! > Why RPC message are transferred with header and body separately in TCP frame > > > Key: SPARK-21703 > URL: https://issues.apache.org/jira/browse/SPARK-21703 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: neoremind >Priority: Trivial > Labels: RPC > > After seeing the details of how spark leverage netty, I found one question, > typically RPC message wire format would have a header+payload structure, and > netty uses a TransportFrameDecoder to deal with how to determine a complete > message from remote peer. But after using Wireshark sniffing tool, I found > that the message are sent separately with header and then followed by body, > although this works fine, but for underlying TCP there would be ACK segments > sent back to acknowledge, there might be a little bit redundancy since we can > sent them together and the header are usually very small. > The main reason can be found in MessageWithHeader class, since transferTo > method write tow times for header and body. > Could someone help me understand the background story on why to implement in > such way? Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21703) Why RPC message are transferred with header and body separately in TCP frame
[ https://issues.apache.org/jira/browse/SPARK-21703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] neoremind updated SPARK-21703: -- Description: After seeing the details of how spark leverage netty, I found one question, typically RPC message wire format would have a header+payload structure, and netty uses a TransportFrameDecoder to deal with how to determine a complete message from remote peer. But after using Wireshark sniffing tool, I found that the message are sent separately with header and then followed by body, although this works fine, but for underlying TCP there would be ACK segments sent back to acknowledge, there might be a little bit redundancy since we can sent them together and the header are usually very small. The main reason can be found in MessageWithHeader class, since transferTo method write tow times for header and body. Could some one help me understand the background story on how to implement in such way? Thanks! was: After seeing the details of how spark leverage netty, I found one question, typically RPC message wire format would have a header+payload structure, and netty uses a TransportFrameDecoder to deal with how to determine a complete message from remote peer. But after using Wireshark sniffing tool, I found that the message are sent separately with header and then a body, although this works fine, but for underlying TCP there would be ACK segments sent back to acknowledge, there might be a little bit redundancy since we can sent them together and the header are usually very small. The main reason can be found in MessageWithHeader class, since transferTo method write tow times for header and body. Could some one help me understand the background story on how to implement in such way? Thanks! > Why RPC message are transferred with header and body separately in TCP frame > > > Key: SPARK-21703 > URL: https://issues.apache.org/jira/browse/SPARK-21703 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: neoremind >Priority: Trivial > Labels: RPC > > After seeing the details of how spark leverage netty, I found one question, > typically RPC message wire format would have a header+payload structure, and > netty uses a TransportFrameDecoder to deal with how to determine a complete > message from remote peer. But after using Wireshark sniffing tool, I found > that the message are sent separately with header and then followed by body, > although this works fine, but for underlying TCP there would be ACK segments > sent back to acknowledge, there might be a little bit redundancy since we can > sent them together and the header are usually very small. > The main reason can be found in MessageWithHeader class, since transferTo > method write tow times for header and body. > Could some one help me understand the background story on how to implement in > such way? Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21703) Why RPC message are transferred with header and body separately in TCP frame
neoremind created SPARK-21703: - Summary: Why RPC message are transferred with header and body separately in TCP frame Key: SPARK-21703 URL: https://issues.apache.org/jira/browse/SPARK-21703 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 1.6.0 Reporter: neoremind Priority: Trivial After seeing the details of how spark leverage netty, I found one question, typically RPC message wire format would have a header+payload structure, and netty uses a TransportFrameDecoder to deal with how to determine a complete message from remote peer. But after using Wireshark sniffing tool, I found that the message are sent separately with header and then a body, although this works fine, but for underlying TCP there would be ACK segments sent back to acknowledge, there might be a little bit redundancy since we can sent them together and the header are usually very small. The main reason can be found in MessageWithHeader class, since transferTo method write tow times for header and body. Could some one help me understand the background story on how to implement in such way? Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used
George Pongracz created SPARK-21702: --- Summary: Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used Key: SPARK-21702 URL: https://issues.apache.org/jira/browse/SPARK-21702 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Environment: Hadoop 2.7.3: AWS SDK 1.7.4 Hadoop 2.8.1: AWS SDK 1.10.6 Reporter: George Pongracz Priority: Minor Settings: .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", "AES256") When writing to an S3 sink from structured streaming the files are being encrypted using AES-256 When introducing a "PartitionBy" the output data files are unencrypted. All other supporting files, metadata are encrypted Suspect write to temp is encrypted and move/rename is not applying the SSE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client
neoremind created SPARK-21701: - Summary: Add TCP send/rcv buffer size support for RPC client Key: SPARK-21701 URL: https://issues.apache.org/jira/browse/SPARK-21701 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.0 Reporter: neoremind Priority: Trivial For TransportClientFactory class, there are no params derived from SparkConf to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. Increasing the receive buffer size can increase the I/O performance for high-volume transport. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21700) How can I get the MetricsSystem information
[ https://issues.apache.org/jira/browse/SPARK-21700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122762#comment-16122762 ] Alex Bozarth commented on SPARK-21700: -- I would recommend taking a look at the Metrics REST API (https://spark.apache.org/docs/latest/monitoring.html#rest-api). Also in the future the email list would probably be a better place for questions like these. > How can I get the MetricsSystem information > --- > > Key: SPARK-21700 > URL: https://issues.apache.org/jira/browse/SPARK-21700 > Project: Spark > Issue Type: Question > Components: Web UI >Affects Versions: 1.6.0 >Reporter: LiuXiangyu > > I want to get the information that shows on the spark Web UI, but I don't > want to write a spider to get those information from the website, and I know > those information are come from MetricsSystem, is there any way that I can > use the MetricsSystem in my program to get those metrics information? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
[ https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-21693: - Description: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in Spark. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify two things that look taking a quite a bit of time: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (roughly 10ish mins) https://www.appveyor.com/docs/build-cache/ {quote} Note: Saving cache is disabled in Pull Request builds. {quote} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). I am also checking and testing other ways. was: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in Spark. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify two things that look taking a quite a bit of time: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (10-20ish mins) https://www.appveyor.com/docs/build-cache/ {quote} Note: Saving cache is disabled in Pull Request builds. {quote} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). I am also checking and testing other ways. > AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests > - > > Key: SPARK-21693 > URL: https://issues.apache.org/jira/browse/SPARK-21693 > Project: Spark > Issue Type: Test > Components: Build, SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > We finally sometimes reach the time limit, 1.5 hours, > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master > I requested to increase this from an hour to 1.5 hours before but it looks we > should fix this in Spark. I asked this for my account few times before but it > looks we can't increase this time limit again and again. > I could identify two things that look taking a quite a bit of time: > 1. Disabled cache feature in pull request builder, which ends up downloading > Maven dependencies (roughly 10ish mins) > https://www.appveyor.com/docs/build-cache/ > {quote} > Note: Saving cache is disabled in Pull Request builds. > {quote} > and also see > http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working > This seems difficult to fix within Spark. > 2. "MLlib classification algorithms" tests (30-35ish mins) > This test below looks taking 30-35ish mins. > {code} > MLlib classification algorithms, except for tree-based algorithms: Spark > package found in SPARK_HOME: C:\projects\spark\bin\.. > .. > {code} > As a (I think) last resort, we could make a matrix for this test alone, so > that we run the other tests after a build and then run this test after > another build, for example, I run Scala tests by this workaround - >
[jira] [Created] (SPARK-21700) How can I get the MetricsSystem information
LiuXiangyu created SPARK-21700: -- Summary: How can I get the MetricsSystem information Key: SPARK-21700 URL: https://issues.apache.org/jira/browse/SPARK-21700 Project: Spark Issue Type: Question Components: Web UI Affects Versions: 1.6.0 Reporter: LiuXiangyu I want to get the information that shows on the spark Web UI, but I don't want to write a spider to get those information from the website, and I know those information are come from MetricsSystem, is there any way that I can use the MetricsSystem in my program to get those metrics information? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21699) Remove unused getTableOption in ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-21699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-21699: Fix Version/s: 2.2.1 > Remove unused getTableOption in ExternalCatalog > --- > > Key: SPARK-21699 > URL: https://issues.apache.org/jira/browse/SPARK-21699 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.2.1, 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21699) Remove unused getTableOption in ExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-21699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-21699. - Resolution: Fixed Fix Version/s: 2.3.0 > Remove unused getTableOption in ExternalCatalog > --- > > Key: SPARK-21699 > URL: https://issues.apache.org/jira/browse/SPARK-21699 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21564) TaskDescription decoding failure should fail the task
[ https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122651#comment-16122651 ] Andrew Ash commented on SPARK-21564: [~irashid] a possible fix could look roughly like this: {noformat} diff --git a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala index a2f1aa22b0..06d72fe106 100644 --- a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala +++ b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala @@ -17,6 +17,7 @@ package org.apache.spark.executor +import java.io.{DataInputStream, NotSerializableException} import java.net.URL import java.nio.ByteBuffer import java.util.Locale @@ -35,7 +36,7 @@ import org.apache.spark.rpc._ import org.apache.spark.scheduler.{ExecutorLossReason, TaskDescription} import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages._ import org.apache.spark.serializer.SerializerInstance -import org.apache.spark.util.{ThreadUtils, Utils} +import org.apache.spark.util.{ByteBufferInputStream, ThreadUtils, Utils} private[spark] class CoarseGrainedExecutorBackend( override val rpcEnv: RpcEnv, @@ -93,9 +94,26 @@ private[spark] class CoarseGrainedExecutorBackend( if (executor == null) { exitExecutor(1, "Received LaunchTask command but executor was null") } else { -val taskDesc = TaskDescription.decode(data.value) -logInfo("Got assigned task " + taskDesc.taskId) -executor.launchTask(this, taskDesc) +try { + val taskDesc = TaskDescription.decode(data.value) + logInfo("Got assigned task " + taskDesc.taskId) + executor.launchTask(this, taskDesc) +} catch { + case e: Exception => +val taskId = new DataInputStream(new ByteBufferInputStream( + ByteBuffer.wrap(data.value.array(.readLong() +val ser = env.closureSerializer.newInstance() +val serializedTaskEndReason = { + try { +ser.serialize(new ExceptionFailure(e, Nil)) + } catch { +case _: NotSerializableException => + // e is not serializable so just send the stacktrace + ser.serialize(new ExceptionFailure(e, Nil, false)) + } +} +statusUpdate(taskId, TaskState.FAILED, serializedTaskEndReason) +} } case KillTask(taskId, _, interruptThread, reason) => {noformat} The downside here though is that we're still making the assumption that the TaskDescription is well-formatted enough to be able to get the taskId out of it (the first long in the serialized bytes). Any other thoughts on how to do this? > TaskDescription decoding failure should fail the task > - > > Key: SPARK-21564 > URL: https://issues.apache.org/jira/browse/SPARK-21564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash > > cc [~robert3005] > I was seeing an issue where Spark was throwing this exception: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring error > java.io.EOFException: null > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readUTF(DataInputStream.java:609) > at java.io.DataInputStream.readUTF(DataInputStream.java:564) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > For details on the cause of that exception, see SPARK-21563 > We've since changed the application and have a proposed fix in Spark at the > ticket above, but it was troubling that decoding the
[jira] [Commented] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars
[ https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122549#comment-16122549 ] Andrew Ash commented on SPARK-21563: Thanks for the thoughts [~irashid] -- I submitted a PR implementing this approach at https://github.com/apache/spark/pull/18913 > Race condition when serializing TaskDescriptions and adding jars > > > Key: SPARK-21563 > URL: https://issues.apache.org/jira/browse/SPARK-21563 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash > > cc [~robert3005] > I was seeing this exception during some running Spark jobs: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring error > java.io.EOFException: null > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readUTF(DataInputStream.java:609) > at java.io.DataInputStream.readUTF(DataInputStream.java:564) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > After some debugging, we determined that this is due to a race condition in > task serde. cc [~irashid] [~kayousterhout] who last touched that code in > SPARK-19796 > The race is between adding additional jars to the SparkContext and > serializing the TaskDescription. > Consider this sequence of events: > - TaskSetManager creates a TaskDescription using a reference to the > SparkContext's jars: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506 > - TaskDescription starts serializing, and begins writing jars: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84 > - the size of the jar map is written out: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63 > - _on another thread_: the application adds a jar to the SparkContext's jars > list > - then the entries in the jars list are serialized out: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64 > The problem now is that the jars list is serialized as having N entries, but > actually N+1 entries follow that count! > This causes task deserialization to fail in the executor, with the stacktrace > above. > The same issue also likely exists for files, though I haven't observed that > and our application does not stress that codepath the same way it did for jar > additions. > One fix here is that TaskSetManager could make an immutable copy of the jars > list that it passes into the TaskDescription constructor, so that list > doesn't change mid-serialization. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21699) Remove unused getTableOption in ExternalCatalog
Reynold Xin created SPARK-21699: --- Summary: Remove unused getTableOption in ExternalCatalog Key: SPARK-21699 URL: https://issues.apache.org/jira/browse/SPARK-21699 Project: Spark Issue Type: Task Components: SQL Affects Versions: 2.3.0 Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
[ https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-21693: - Description: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in Spark. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify two things that look taking a quite a bit of time: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (10-20ish mins) https://www.appveyor.com/docs/build-cache/ {quote} Note: Saving cache is disabled in Pull Request builds. {quote} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). I am also checking and testing other ways. was: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in AppVeyor. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify two things that look taking a quite a bit of time: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (10-20ish mins) https://www.appveyor.com/docs/build-cache/ {quote} Note: Saving cache is disabled in Pull Request builds. {quote} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). I am also checking and testing other ways. > AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests > - > > Key: SPARK-21693 > URL: https://issues.apache.org/jira/browse/SPARK-21693 > Project: Spark > Issue Type: Test > Components: Build, SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > We finally sometimes reach the time limit, 1.5 hours, > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master > I requested to increase this from an hour to 1.5 hours before but it looks we > should fix this in Spark. I asked this for my account few times before but it > looks we can't increase this time limit again and again. > I could identify two things that look taking a quite a bit of time: > 1. Disabled cache feature in pull request builder, which ends up downloading > Maven dependencies (10-20ish mins) > https://www.appveyor.com/docs/build-cache/ > {quote} > Note: Saving cache is disabled in Pull Request builds. > {quote} > and also see > http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working > This seems difficult to fix within Spark. > 2. "MLlib classification algorithms" tests (30-35ish mins) > This test below looks taking 30-35ish mins. > {code} > MLlib classification algorithms, except for tree-based algorithms: Spark > package found in SPARK_HOME: C:\projects\spark\bin\.. > .. > {code} > As a (I think) last resort, we could make a matrix for this test alone, so > that we run the other tests after a build and then run this test after > another build, for example, I run Scala tests by this workaround - >
[jira] [Commented] (SPARK-21685) Params isSet in scala Transformer triggered by _setDefault in pyspark
[ https://issues.apache.org/jira/browse/SPARK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122492#comment-16122492 ] Joseph K. Bradley commented on SPARK-21685: --- Could you please point to more info, such as the Python wrappers you are calling? I don't see enough info here to identify the problem. > Params isSet in scala Transformer triggered by _setDefault in pyspark > - > > Key: SPARK-21685 > URL: https://issues.apache.org/jira/browse/SPARK-21685 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Ratan Rai Sur > > I'm trying to write a PySpark wrapper for a Transformer whose transform > method includes the line > {code:java} > require(!(isSet(outputNodeName) && isSet(outputNodeIndex)), "Can't set both > outputNodeName and outputNodeIndex") > {code} > This should only throw an exception when both of these parameters are > explicitly set. > In the PySpark wrapper for the Transformer, there is this line in ___init___ > {code:java} > self._setDefault(outputNodeIndex=0) > {code} > Here is the line in the main python script showing how it is being configured > {code:java} > cntkModel = > CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(spark, > model.uri).setOutputNodeName("z") > {code} > As you can see, only setOutputNodeName is being explicitly set but the > exception is still being thrown. > If you need more context, > https://github.com/RatanRSur/mmlspark/tree/default-cntkmodel-output is the > branch with the code, the files I'm referring to here that are tracked are > the following: > src/cntk-model/src/main/scala/CNTKModel.scala > notebooks/tests/301 - CIFAR10 CNTK CNN Evaluation.ipynb > The pyspark wrapper code is autogenerated -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data
[ https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis updated SPARK-21698: - Description: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. {code:title=Program Output|borderStyle=solid} 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| +---++-+ 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ +---++-+ | id|name|count| +---++-+ | 9| 4| null| | 10| 6| null| | 7| 1| null| | 8| 2| null| | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ {code} In the last show(). I see the data is null {code:title=spark init|borderStyle=solid} self.spark = SparkSession \ .builder \ .master("spark://localhost:7077") \ .enableHiveSupport() \ .getOrCreate() {code} {code:title=Code for the test case|borderStyle=solid} def test_clean_insert_table(self): table_name = "data" data0 = [ {"id": 1, "name":"1", "count": 1}, {"id": 2, "name":"2", "count": 2}, {"id": 3, "name":"3", "count": 3}, ] df_data0 = self.spark.createDataFrame(data0) df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name) df_return = self.spark.read.table(table_name) df_return.show() data1 = [ {"id": 4, "name":"4", "count": 4}, {"id": 5, "name":"5", "count": 5}, {"id": 6, "name":"6", "count": 6}, ] df_data1 = self.spark.createDataFrame(data1) df_data1.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() data3 = [ {"id": 1, "name":"one", "count":7}, {"id": 2, "name":"two", "count": 8}, {"id": 4, "name":"three", "count": 9}, {"id": 6, "name":"six", "count":10} ] data3 = self.spark.createDataFrame(data3) data3.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() {code} was: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. {code:title=Program Output|borderStyle=solid} 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| +---++-+ 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 4| 4|4| | 5| 5|5| | 5| 5|5| | 6| 6|6| | 6| 6|6| +---++-+ {code} In the last show(). I see the data isn't what I would expect. {code:title=spark init|borderStyle=solid} self.spark = SparkSession \ .builder \ .master("spark://localhost:7077") \ .enableHiveSupport() \ .getOrCreate() {code} {code:title=Code for the test case|borderStyle=solid} def test_clean_insert_table(self): table_name = "data" data0 = [ {"id": 1,
[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data
[ https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis updated SPARK-21698: - Description: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. {code:title=Program Output|borderStyle=solid} 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| +---++-+ 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 4| 4|4| | 5| 5|5| | 5| 5|5| | 6| 6|6| | 6| 6|6| +---++-+ {code} In the last show(). I see the data isn't what I would expect. {code:title=spark init|borderStyle=solid} self.spark = SparkSession \ .builder \ .master("spark://localhost:7077") \ .enableHiveSupport() \ .getOrCreate() {code} {code:title=Code for the test case|borderStyle=solid} def test_clean_insert_table(self): table_name = "data" data0 = [ {"id": 1, "name":"1", "count": 1}, {"id": 2, "name":"2", "count": 2}, {"id": 3, "name":"3", "count": 3}, ] df_data0 = self.spark.createDataFrame(data0) df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name) df_return = self.spark.read.table(table_name) df_return.show() data1 = [ {"id": 4, "name":"4", "count": 4}, {"id": 5, "name":"5", "count": 5}, {"id": 6, "name":"6", "count": 6}, ] df_data1 = self.spark.createDataFrame(data1) df_data1.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() data3 = [ {"id": 1, "name":"one", "count":7}, {"id": 2, "name":"two", "count": 8}, {"id": 4, "name":"three", "count": 9}, {"id": 6, "name":"six", "count":10} ] data3 = self.spark.createDataFrame(data3) data3.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() {code} was: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. {code:title=Program Output|borderStyle=solid} 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| +---++-+ 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 4| 4|4| | 5| 5|5| | 5| 5|5| | 6| 6|6| | 6| 6|6| +---++-+ {code} In the last show(). I see the data isn't what I would expect. {code:title=spark init|borderStyle=solid} self.spark = SparkSession \ .builder \ .master("spark://localhost:7077") \ .enableHiveSupport() \ .getOrCreate() {code} {code:title=Code for the test case|borderStyle=solid} def test_clean_insert_table(self): table_name = "data" data0 = [ {"id": 1,
[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data
[ https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis updated SPARK-21698: - Description: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. {code:title=Program Output|borderStyle=solid} 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| +---++-+ 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 4| 4|4| | 5| 5|5| | 5| 5|5| | 6| 6|6| | 6| 6|6| +---++-+ {code} In the last show(). I see the data isn't what I would expect. {code:title=spark init|borderStyle=solid} self.spark = SparkSession \ .builder \ .master("spark://localhost:7077") \ .enableHiveSupport() \ .getOrCreate() {code} {code:title=Code for the test case|borderStyle=solid} def test_clean_insert_table(self): table_name = "data" data0 = [ {"id": 1, "name":"1", "count": 1}, {"id": 2, "name":"2", "count": 2}, {"id": 3, "name":"3", "count": 3}, ] df_data0 = self.spark.createDataFrame(data0) df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name) df_return = self.spark.read.table(table_name) df_return.show() data1 = [ {"id": 4, "name":"4", "count": 4}, {"id": 5, "name":"5", "count": 5}, {"id": 6, "name":"6", "count": 6}, ] df_data1 = self.spark.createDataFrame(data1) df_data1.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() data3 = [ {"id": 1, "name":"one", "count":7}, {"id": 2, "name":"two", "count": 8}, {"id": 4, "name":"three", "count": 9}, {"id": 6, "name":"six", "count":10} ] data3 = self.spark.createDataFrame(data1) data3.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() {code} was: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. h4. rogram output output : {code:title=Program Output|borderStyle=solid} 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| +---++-+ 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 4| 4|4| | 5| 5|5| | 5| 5|5| | 6| 6|6| | 6| 6|6| +---++-+ {code} In the last show(). I see the data isn't what I would expect. {code:title=spark init|borderStyle=solid} self.spark = SparkSession \ .builder \ .master("spark://localhost:7077") \ .enableHiveSupport() \ .getOrCreate() {code} {code:title=Code for the test case|borderStyle=solid} def test_clean_insert_table(self): table_name = "data" data0 = [
[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data
[ https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis updated SPARK-21698: - Description: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. h4. rogram output output : {code:title=Program Output|borderStyle=solid} 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| +---++-+ 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 4| 4|4| | 5| 5|5| | 5| 5|5| | 6| 6|6| | 6| 6|6| +---++-+ {code} In the last show(). I see the data isn't what I would expect. {code:title=spark init|borderStyle=solid} self.spark = SparkSession \ .builder \ .master("spark://localhost:7077") \ .enableHiveSupport() \ .getOrCreate() {code} {code:title=Code for the test case|borderStyle=solid} def test_clean_insert_table(self): table_name = "data" data0 = [ {"id": 1, "name":"1", "count": 1}, {"id": 2, "name":"2", "count": 2}, {"id": 3, "name":"3", "count": 3}, ] df_data0 = self.spark.createDataFrame(data0) df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name) df_return = self.spark.read.table(table_name) df_return.show() data1 = [ {"id": 4, "name":"4", "count": 4}, {"id": 5, "name":"5", "count": 5}, {"id": 6, "name":"6", "count": 6}, ] df_data1 = self.spark.createDataFrame(data1) df_data1.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() data3 = [ {"id": 1, "name":"one", "count":7}, {"id": 2, "name":"two", "count": 8}, {"id": 4, "name":"three", "count": 9}, {"id": 6, "name":"six", "count":10} ] data3 = self.spark.createDataFrame(data1) data3.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() {code} was: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. h4. rogram output output : {code:title=Program Output|borderStyle=solid} +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 4| 4|4| | 5| 5|5| | 5| 5|5| | 6| 6|6| | 6| 6|6| +---++-+ {code} In the last show(). I see the data isn't what I would expect. {code:title=spark init|borderStyle=solid} self.spark = SparkSession \ .builder \ .master("spark://localhost:7077") \ .enableHiveSupport() \ .getOrCreate() {code} {code:title=Code for the test case|borderStyle=solid} def test_clean_insert_table(self): table_name = "data" data0 = [ {"id": 1, "name":"1", "count": 1}, {"id": 2, "name":"2", "count": 2}, {"id": 3, "name":"3", "count": 3}, ] df_data0 = self.spark.createDataFrame(data0) df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name) df_return = self.spark.read.table(table_name) df_return.show() data1 = [ {"id": 4, "name":"4", "count": 4}, {"id": 5, "name":"5", "count": 5}, {"id": 6, "name":"6", "count": 6}, ] df_data1 = self.spark.createDataFrame(data1) df_data1.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() data3 = [ {"id": 1, "name":"one", "count":7}, {"id": 2, "name":"two", "count": 8}, {"id": 4, "name":"three", "count": 9}, {"id": 6, "name":"six", "count":10}
[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data
[ https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis updated SPARK-21698: - Description: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. h4. rogram output output : {code:title=Program Output|borderStyle=solid} +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 4| 4|4| | 5| 5|5| | 5| 5|5| | 6| 6|6| | 6| 6|6| +---++-+ {code} In the last show(). I see the data isn't what I would expect. {code:title=spark init|borderStyle=solid} self.spark = SparkSession \ .builder \ .master("spark://localhost:7077") \ .enableHiveSupport() \ .getOrCreate() {code} {code:title=Code for the test case|borderStyle=solid} def test_clean_insert_table(self): table_name = "data" data0 = [ {"id": 1, "name":"1", "count": 1}, {"id": 2, "name":"2", "count": 2}, {"id": 3, "name":"3", "count": 3}, ] df_data0 = self.spark.createDataFrame(data0) df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name) df_return = self.spark.read.table(table_name) df_return.show() data1 = [ {"id": 4, "name":"4", "count": 4}, {"id": 5, "name":"5", "count": 5}, {"id": 6, "name":"6", "count": 6}, ] df_data1 = self.spark.createDataFrame(data1) df_data1.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() data3 = [ {"id": 1, "name":"one", "count":7}, {"id": 2, "name":"two", "count": 8}, {"id": 4, "name":"three", "count": 9}, {"id": 6, "name":"six", "count":10} ] data3 = self.spark.createDataFrame(data1) data3.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() {code} was: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. h4. rogram output output : {code:title=Program Output|borderStyle=solid} 17/08/10 15:30:41 WARN [SparkUtils]: [Creating/Writing Table] data [TestSparkUtils]: DEBUG: [ Initial Create --] +-+---++ |count| id|name| +-+---++ |1| 1| 1| |2| 2| 2| |3| 3| 3| +-+---++ [TestSparkUtils]: DEBUG: [ Insert No Duplicates --] 17/08/10 15:30:43 WARN [SparkUtils]: [Inserting Into Table] data 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data
[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data
[ https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis updated SPARK-21698: - Description: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. h4. rogram output output : {code:title=Program Output|borderStyle=solid} 17/08/10 15:30:41 WARN [SparkUtils]: [Creating/Writing Table] data [TestSparkUtils]: DEBUG: [ Initial Create --] +-+---++ |count| id|name| +-+---++ |1| 1| 1| |2| 2| 2| |3| 3| 3| +-+---++ [TestSparkUtils]: DEBUG: [ Insert No Duplicates --] 17/08/10 15:30:43 WARN [SparkUtils]: [Inserting Into Table] data 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 17/08/10 15:30:44 WARN log: Updated size to 545 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 17/08/10 15:30:44 WARN log: Updated size to 545 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 17/08/10 15:30:44 WARN log: Updated size to 545 +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ +-+---+-+ |count| id| name| +-+---+-+ |7| 1| one| |8| 2| two| |9| 4|three| | 10| 6| six| +-+---+-+ [TestSparkUtils]: DEBUG: [ Update --] 17/08/10 15:30:44 WARN [SparkUtils]: [Inserting Into Table] data 17/08/10 15:30:45 WARN log: Updating partition stats fast for: data 17/08/10 15:30:45 WARN log: Updated size to 1122 +---++-+ | id|name|count| +---++-+ | 9| 4| null| | 10| 6| null| | 7| 1| null| | 8| 2| null| | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ .. -- Ran 2 tests in 11.559s OK {code} In the last show(). I see the data is corrupted. The data was switched on the columns, and I am getting null results. Below is the main clips of the code I am using generate the problem: {code:title=spark init|borderStyle=solid}
[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data
[ https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis updated SPARK-21698: - Description: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. h4. rogram output output : {code:title=Program Output|borderStyle=solid} 17/08/10 15:30:41 WARN [SparkUtils]: [Creating/Writing Table] data [TestSparkUtils]: DEBUG: [ Initial Create --] +-+---++ |count| id|name| +-+---++ |1| 1| 1| |2| 2| 2| |3| 3| 3| +-+---++ [TestSparkUtils]: DEBUG: [ Insert No Duplicates --] 17/08/10 15:30:43 WARN [SparkUtils]: [Inserting Into Table] data 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 17/08/10 15:30:44 WARN log: Updated size to 545 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 17/08/10 15:30:44 WARN log: Updated size to 545 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 17/08/10 15:30:44 WARN log: Updated size to 545 +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ +-+---+-+ |count| id| name| +-+---+-+ |7| 1| one| |8| 2| two| |9| 4|three| | 10| 6| six| +-+---+-+ [TestSparkUtils]: DEBUG: [ Update --] 17/08/10 15:30:44 WARN [SparkUtils]: [Inserting Into Table] data 17/08/10 15:30:45 WARN log: Updating partition stats fast for: data 17/08/10 15:30:45 WARN log: Updated size to 1122 +---++-+ | id|name|count| +---++-+ | 9| 4| null| | 10| 6| null| | 7| 1| null| | 8| 2| null| | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ .. -- Ran 2 tests in 11.559s OK {code} In the last show(). I see the data is corrupted. The data was switched on the columns, and I am getting null results. Below is the main clips of the code I am using generate the problem: {code:title=spark init|borderStyle=solid}
[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data
[ https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis updated SPARK-21698: - Summary: write.partitionBy() is giving me garbage data (was: write.partitionBy() is given me garbage data) > write.partitionBy() is giving me garbage data > - > > Key: SPARK-21698 > URL: https://issues.apache.org/jira/browse/SPARK-21698 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0 > Environment: Linux Ubuntu 17.04. Python 3.5. >Reporter: Luis > > Spark partionBy is causing some data corruption. I am doing three super > simple writes. . Below is the code to reproduce the problem. > h4. I'll by showing the program output output : > {code:title=Program Output|borderStyle=solid} > 17/08/10 15:30:41 WARN [SparkUtils]: [Creating/Writing Table] data > > > [TestSparkUtils]: DEBUG: [ Initial Create --] > > > +-+---++ > > > |count| id|name| > > > +-+---++ > > > |1| 1| 1| > > > |2| 2| 2| > > > |3| 3| 3| > > > +-+---++ > > > > > > [TestSparkUtils]: DEBUG: [ Insert No Duplicates --] > > > 17/08/10 15:30:43 WARN [SparkUtils]: [Inserting Into Table] data > > > 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data > > > 17/08/10 15:30:44 WARN log: Updated size to 545 > > > 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data > > > 17/08/10 15:30:44 WARN log: Updated size to 545 > > > 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data > 17/08/10 15:30:44 WARN log: Updated size to 545 > +---++-+ > | id|name|count| > +---++-+ > | 1| 1|1| > | 2| 2|2| > | 3| 3|3| > | 4| 4|4| > | 5| 5|5| > | 6| 6|6| > +---++-+ > +-+---+-+ > |count| id| name| > +-+---+-+ > |7| 1| one| > |8| 2| two| > |9| 4|three| > | 10| 6| six| > +-+---+-+ > [TestSparkUtils]: DEBUG: [ Update --] > 17/08/10 15:30:44 WARN [SparkUtils]: [Inserting Into Table] data > 17/08/10 15:30:45 WARN
[jira] [Created] (SPARK-21698) write.partitionBy() is given me garbage data
Luis created SPARK-21698: Summary: write.partitionBy() is given me garbage data Key: SPARK-21698 URL: https://issues.apache.org/jira/browse/SPARK-21698 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0, 2.1.1 Environment: Linux Ubuntu 17.04. Python 3.5. Reporter: Luis Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. h4. I'll by showing the program output output : {code:title=Program Output|borderStyle=solid} 17/08/10 15:30:41 WARN [SparkUtils]: [Creating/Writing Table] data [TestSparkUtils]: DEBUG: [ Initial Create --] +-+---++ |count| id|name| +-+---++ |1| 1| 1| |2| 2| 2| |3| 3| 3| +-+---++ [TestSparkUtils]: DEBUG: [ Insert No Duplicates --] 17/08/10 15:30:43 WARN [SparkUtils]: [Inserting Into Table] data 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 17/08/10 15:30:44 WARN log: Updated size to 545 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 17/08/10 15:30:44 WARN log: Updated size to 545 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 17/08/10 15:30:44 WARN log: Updated size to 545 +---++-+ | id|name|count| +---++-+ | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ +-+---+-+ |count| id| name| +-+---+-+ |7| 1| one| |8| 2| two| |9| 4|three| | 10| 6| six| +-+---+-+ [TestSparkUtils]: DEBUG: [ Update --] 17/08/10 15:30:44 WARN [SparkUtils]: [Inserting Into Table] data 17/08/10 15:30:45 WARN log: Updating partition stats fast for: data 17/08/10 15:30:45 WARN log: Updated size to 1122 +---++-+ | id|name|count| +---++-+ | 9| 4| null| | 10| 6| null| | 7| 1| null| | 8| 2| null| | 1| 1|1| | 2| 2|2| | 3| 3|3| | 4| 4|4| | 5| 5|5| | 6| 6|6| +---++-+ .. -- Ran 2 tests in
[jira] [Resolved] (SPARK-21638) Warning message of RF is not accurate
[ https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21638. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18868 [https://github.com/apache/spark/pull/18868] > Warning message of RF is not accurate > - > > Key: SPARK-21638 > URL: https://issues.apache.org/jira/browse/SPARK-21638 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: >Reporter: Peng Meng >Priority: Minor > Fix For: 2.3.0 > > > When train RF model, there is many warning message like this: > {quote}WARN RandomForest: Tree learning is using approximately 268492800 > bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. > This allows splitting 2622 nodes in this iteration.{quote} > This warning message is unnecessary and the data is not accurate. > Actually, if all the nodes cannot split in one iteration, it will show this > warning. For most of the case, all the nodes cannot split just in one > iteration, so for most of the case, it will show this warning for each > iteration. > This is because: > {code:java} > while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { > val (treeIndex, node) = nodeStack.top > // Choose subset of features for node (if subsampling). > val featureSubset: Option[Array[Int]] = if > (metadata.subsamplingFeatures) { > Some(SamplingUtils.reservoirSampleAndCount(Range(0, > metadata.numFeatures).iterator, metadata.numFeaturesPerNode, > rng.nextLong())._1) > } else { > None > } > // Check if enough memory remains to add this node to the group. > val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, > featureSubset) * 8L > if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { > nodeStack.pop() > mutableNodesForGroup.getOrElseUpdate(treeIndex, new > mutable.ArrayBuffer[LearningNode]()) += > node > mutableTreeToNodeToIndexInfo > .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, > NodeIndexInfo]())(node.id) > = new NodeIndexInfo(numNodesInGroup, featureSubset) > } > numNodesInGroup += 1 //we not add the node to mutableNodesForGroup, > but we add memUsage here. > memUsage += nodeMemUsage > } > if (memUsage > maxMemoryUsage) { > // If maxMemoryUsage is 0, we should still allow splitting 1 node. > logWarning(s"Tree learning is using approximately $memUsage bytes per > iteration, which" + > s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This > allows splitting" + > s" $numNodesInGroup nodes in this iteration.") > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21638) Warning message of RF is not accurate
[ https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21638: - Assignee: Peng Meng > Warning message of RF is not accurate > - > > Key: SPARK-21638 > URL: https://issues.apache.org/jira/browse/SPARK-21638 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 > Environment: >Reporter: Peng Meng >Assignee: Peng Meng >Priority: Minor > Fix For: 2.3.0 > > > When train RF model, there is many warning message like this: > {quote}WARN RandomForest: Tree learning is using approximately 268492800 > bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. > This allows splitting 2622 nodes in this iteration.{quote} > This warning message is unnecessary and the data is not accurate. > Actually, if all the nodes cannot split in one iteration, it will show this > warning. For most of the case, all the nodes cannot split just in one > iteration, so for most of the case, it will show this warning for each > iteration. > This is because: > {code:java} > while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) { > val (treeIndex, node) = nodeStack.top > // Choose subset of features for node (if subsampling). > val featureSubset: Option[Array[Int]] = if > (metadata.subsamplingFeatures) { > Some(SamplingUtils.reservoirSampleAndCount(Range(0, > metadata.numFeatures).iterator, metadata.numFeaturesPerNode, > rng.nextLong())._1) > } else { > None > } > // Check if enough memory remains to add this node to the group. > val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, > featureSubset) * 8L > if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) { > nodeStack.pop() > mutableNodesForGroup.getOrElseUpdate(treeIndex, new > mutable.ArrayBuffer[LearningNode]()) += > node > mutableTreeToNodeToIndexInfo > .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, > NodeIndexInfo]())(node.id) > = new NodeIndexInfo(numNodesInGroup, featureSubset) > } > numNodesInGroup += 1 //we not add the node to mutableNodesForGroup, > but we add memUsage here. > memUsage += nodeMemUsage > } > if (memUsage > maxMemoryUsage) { > // If maxMemoryUsage is 0, we should still allow splitting 1 node. > logWarning(s"Tree learning is using approximately $memUsage bytes per > iteration, which" + > s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This > allows splitting" + > s" $numNodesInGroup nodes in this iteration.") > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS
[ https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122234#comment-16122234 ] Weiqing Yang commented on SPARK-21697: -- Thanks for filing this issue! > NPE & ExceptionInInitializerError trying to load UTF from HDFS > -- > > Key: SPARK-21697 > URL: https://issues.apache.org/jira/browse/SPARK-21697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Spark Client mode, Hadoop 2.6.0 >Reporter: Steve Loughran >Priority: Minor > > Reported on [the > PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for > SPARK-12868: trying to load a UDF of HDFS is triggering an > {{ExceptionInInitializerError}}, caused by an NPE which should only happen if > the commons-logging {{LOG}} log is null. > Hypothesis: the commons logging scan for {{commons-logging.properties}} is > happening in the classpath with the HDFS JAR; this is triggering a D/L of the > JAR, which needs to force in commons-logging, and, as that's not inited yet, > NPEs -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS
[ https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122195#comment-16122195 ] Steve Loughran edited comment on SPARK-21697 at 8/10/17 7:48 PM: - Text of the PR comment & stack bq. Have u tried it in yarn-client mode? i add this path in v2.1.1 + Hadoop 2.6.0, when i run "add jar" through SparkSQL CLI , it comes out this error: {code} ERROR thriftserver.SparkSQLDriver: Failed in [add jar hdfs://SunshineNameNode3:8020/lib/clouddata-common-lib/chardet-0.0.1.jar] java.lang.ExceptionInInitializerError at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:947) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2107) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2076) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2052) at org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1274) at org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:632) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:601) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:278) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:267) at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:601) at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:591) at org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:738) at org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:105) at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.Dataset.(Dataset.scala:185) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at
[jira] [Commented] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS
[ https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122195#comment-16122195 ] Steve Loughran commented on SPARK-21697: {code} Have u tried it in yarn-client mode? i add this path in v2.1.1 + Hadoop 2.6.0, when i run "add jar" through SparkSQL CLI , it comes out this error: ERROR thriftserver.SparkSQLDriver: Failed in [add jar hdfs://SunshineNameNode3:8020/lib/clouddata-common-lib/chardet-0.0.1.jar] java.lang.ExceptionInInitializerError at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:947) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2107) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2076) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2052) at org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1274) at org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242) at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163) at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149) at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:632) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:601) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:278) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:267) at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:601) at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:591) at org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:738) at org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:105) at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.Dataset.(Dataset.scala:185) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743) at
[jira] [Created] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS
Steve Loughran created SPARK-21697: -- Summary: NPE & ExceptionInInitializerError trying to load UTF from HDFS Key: SPARK-21697 URL: https://issues.apache.org/jira/browse/SPARK-21697 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Environment: Spark Client mode, Hadoop 2.6.0 Reporter: Steve Loughran Priority: Minor Reported on [the PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for SPARK-12868: trying to load a UDF of HDFS is triggering an {{ExceptionInInitializerError}}, caused by an NPE which should only happen if the commons-logging {{LOG}} log is null. Hypothesis: the commons logging scan for {{commons-logging.properties}} is happening in the classpath with the HDFS JAR; this is triggering a D/L of the JAR, which needs to force in commons-logging, and, as that's not inited yet, NPEs -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21669) Internal API for collecting metrics/stats during FileFormatWriter jobs
[ https://issues.apache.org/jira/browse/SPARK-21669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-21669. - Resolution: Fixed Assignee: Adrian Ionescu Fix Version/s: 2.3.0 > Internal API for collecting metrics/stats during FileFormatWriter jobs > -- > > Key: SPARK-21669 > URL: https://issues.apache.org/jira/browse/SPARK-21669 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Adrian Ionescu >Assignee: Adrian Ionescu > Fix For: 2.3.0 > > > It would be useful to have some infrastructure in place for collecting custom > metrics or statistics on data on the fly, as it is being written to disk. > This was inspired by the work in SPARK-20703, which added simple metrics > collection for data write operations, such as {{numFiles}}, > {{numPartitions}}, {{numRows}}. Those metrics are first collected on the > executors and then sent to the driver, which aggregates and posts them as > updates to the {{SQLMetrics}} subsystem. > The above can be generalized and turned into a pluggable interface, which in > the future could be used for other purposes: e.g. automatic maintenance of > cost-based optimizer (CBO) statistics during "INSERT INTO SELECT ..." > operations, such that users won't need to explicitly call "ANALYZE TABLE > COMPUTE STATISTICS" afterwards anymore, thus avoiding an extra > full-table scan. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21696) State Store can't handle corrupted snapshots
[ https://issues.apache.org/jira/browse/SPARK-21696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122078#comment-16122078 ] Alexander Bessonov commented on SPARK-21696: {{HDFSBackedStateStoreProvider.doMaintenance()}} will supress any {{NonFatal}} exceptions. {{startMaintenanceIfNeeded.startMaintenanceIfNeeded()}} wouldn't restart maintenance if crashed. State Store still can function even when snapshot file is corrupted by simply falling back to deltas. > State Store can't handle corrupted snapshots > > > Key: SPARK-21696 > URL: https://issues.apache.org/jira/browse/SPARK-21696 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Alexander Bessonov >Priority: Critical > > State store's asynchronous maintenance task (generation of Snapshot files) is > not rescheduled if crashed which might lead to corrupted snapshots. > In our case, on multiple occasions, executors died during maintenance task > with Out Of Memory error which led to following error on recovery: > {code:none} > 17/08/07 20:12:24 WARN TaskSetManager: Lost task 3.1 in stage 102.0 (TID > 3314, dnj2-bach-r2n10.bloomberg.com, executor 94): java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:436) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:314) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:313) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:313) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:186) > at > org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at >
[jira] [Updated] (SPARK-21696) State Store can't handle corrupted snapshots
[ https://issues.apache.org/jira/browse/SPARK-21696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Bessonov updated SPARK-21696: --- Description: State store's asynchronous maintenance task (generation of Snapshot files) is not rescheduled if crashed which might lead to corrupted snapshots. In our case, on multiple occasions, executors died during maintenance task with Out Of Memory error which led to following error on recovery: {code:none} 17/08/07 20:12:24 WARN TaskSetManager: Lost task 3.1 in stage 102.0 (TID 3314, dnj2-bach-r2n10.bloomberg.com, executor 94): java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:436) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:314) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:313) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:313) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:186) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} was: State store's asynchronous maintenance task (generation of Snapshot files) is not rescheduled if crashed which might lead to corrupted snapshots. In our case, on multiple occasions, executors died during maintenance task with Out Of Memory error which led to following error on recovery: {code:text} 17/08/07 20:12:24 WARN TaskSetManager: Lost task 3.1 in stage 102.0 (TID 3314, dnj2-bach-r2n10.bloomberg.com, executor 94): java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at
[jira] [Created] (SPARK-21696) State Store can't handle corrupted snapshots
Alexander Bessonov created SPARK-21696: -- Summary: State Store can't handle corrupted snapshots Key: SPARK-21696 URL: https://issues.apache.org/jira/browse/SPARK-21696 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0 Reporter: Alexander Bessonov Priority: Critical State store's asynchronous maintenance task (generation of Snapshot files) is not rescheduled if crashed which might lead to corrupted snapshots. In our case, on multiple occasions, executors died during maintenance task with Out Of Memory error which led to following error on recovery: {code:text} 17/08/07 20:12:24 WARN TaskSetManager: Lost task 3.1 in stage 102.0 (TID 3314, dnj2-bach-r2n10.bloomberg.com, executor 94): java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:436) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:314) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:313) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:313) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:186) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17419) Mesos virtual network support
[ https://issues.apache.org/jira/browse/SPARK-17419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122019#comment-16122019 ] Susan X. Huynh commented on SPARK-17419: SPARK-21694 allows the user to pass network labels to CNI plugins. > Mesos virtual network support > - > > Key: SPARK-17419 > URL: https://issues.apache.org/jira/browse/SPARK-17419 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Michael Gummelt > > http://mesos.apache.org/documentation/latest/cni/ > This will enable launching executors into virtual networks for isolation and > security. It will also enable container per IP. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17419) Mesos virtual network support
[ https://issues.apache.org/jira/browse/SPARK-17419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122018#comment-16122018 ] Susan X. Huynh commented on SPARK-17419: SPARK-18232 adds the ability to launch containers attached to a CNI network, by specifying `--conf spark.mesos.network.name`. > Mesos virtual network support > - > > Key: SPARK-17419 > URL: https://issues.apache.org/jira/browse/SPARK-17419 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Michael Gummelt > > http://mesos.apache.org/documentation/latest/cni/ > This will enable launching executors into virtual networks for isolation and > security. It will also enable container per IP. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21695) Spark scheduler locality algorithm can take longer then expected
Thomas Graves created SPARK-21695: - Summary: Spark scheduler locality algorithm can take longer then expected Key: SPARK-21695 URL: https://issues.apache.org/jira/browse/SPARK-21695 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.1.0 Reporter: Thomas Graves Reference jira https://issues.apache.org/jira/browse/SPARK-21656 I'm seeing an issue with some jobs where the scheduler takes a long time to schedule tasks on executors. The default locality wait is 3 seconds so I was expecting that an executor should get some task on it in max 9 seconds (node local, rack local, any), but its taking way more time then that. In the case of spark-21656 it takes 60+ seconds and executors idle timeout. We should investigate why and see if we can fix this. Upon an initial look it seems the scheduler resets the locality lastLaunchTime whenever it places any task on a node at that locality level. It appears this means it can take way longer then 3 seconds for any particular task to fall back, but this needs to be verified. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21694) Support Mesos CNI network labels
Susan X. Huynh created SPARK-21694: -- Summary: Support Mesos CNI network labels Key: SPARK-21694 URL: https://issues.apache.org/jira/browse/SPARK-21694 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 2.2.0 Reporter: Susan X. Huynh Fix For: 2.3.0 Background: SPARK-18232 added the ability to launch containers attached to a CNI network by specifying the network name via `spark.mesos.network.name`. This ticket is to allow the user to pass network labels to CNI plugins. More details in the related Mesos documentation: http://mesos.apache.org/documentation/latest/cni/#mesos-meta-data-to-cni-plugins -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21688: -- Priority: Minor (was: Major) > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent >Priority: Minor > Attachments: ddot unitest.png, mllib svm training.png, > native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21644) LocalLimit.maxRows is defined incorrectly
[ https://issues.apache.org/jira/browse/SPARK-21644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121946#comment-16121946 ] Xiao Li commented on SPARK-21644: - https://github.com/apache/spark/pull/18851 > LocalLimit.maxRows is defined incorrectly > - > > Key: SPARK-21644 > URL: https://issues.apache.org/jira/browse/SPARK-21644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > {code} > case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends > UnaryNode { > override def output: Seq[Attribute] = child.output > override def maxRows: Option[Long] = { > limitExpr match { > case IntegerLiteral(limit) => Some(limit) > case _ => None > } > } > } > {code} > This is simply wrong, since LocalLimit is only about partition level limits. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions
[ https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121931#comment-16121931 ] Chaoyu Tang commented on SPARK-14927: - [~rajeshc] could you provide your example here? > DataFrame. saveAsTable creates RDD partitions but not Hive partitions > - > > Key: SPARK-14927 > URL: https://issues.apache.org/jira/browse/SPARK-14927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1 > Environment: Mac OS X 10.11.4 local >Reporter: Sasha Ovsankin > > This is a followup to > http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive > . I tried to use suggestions in the answers but couldn't make it to work in > Spark 1.6.1 > I am trying to create partitions programmatically from `DataFrame. Here is > the relevant code (adapted from a Spark test): > hc.setConf("hive.metastore.warehouse.dir", "tmp/tests") > //hc.setConf("hive.exec.dynamic.partition", "true") > //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict") > hc.sql("create database if not exists tmp") > hc.sql("drop table if exists tmp.partitiontest1") > Seq(2012 -> "a").toDF("year", "val") > .write > .partitionBy("year") > .mode(SaveMode.Append) > .saveAsTable("tmp.partitiontest1") > hc.sql("show partitions tmp.partitiontest1").show > Full file is here: > https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a > I get the error that the table is not partitioned: > == > HIVE FAILURE OUTPUT > == > SET hive.support.sql11.reserved.keywords=false > SET hive.metastore.warehouse.dir=tmp/tests > OK > OK > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a > partitioned table > == > It looks like the root cause is that > `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable` > always creates table with empty partitions. > Any help to move this forward is appreciated. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18648) spark-shell --jars option does not add jars to classpath on windows
[ https://issues.apache.org/jira/browse/SPARK-18648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-18648. -- Resolution: Duplicate I checked {{spark-shell --jars C:\test\my.jar}} works and fixed. I am resolving this. > spark-shell --jars option does not add jars to classpath on windows > --- > > Key: SPARK-18648 > URL: https://issues.apache.org/jira/browse/SPARK-18648 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Windows >Affects Versions: 2.0.2 > Environment: Windows 7 x64 >Reporter: Michel Lemay > Labels: windows > > I can't import symbols from command line jars when in the shell: > Adding jars via --jars: > {code} > spark-shell --master local[*] --jars path\to\deeplearning4j-core-0.7.0.jar > {code} > Same result if I add it through maven coordinates: > {code}spark-shell --master local[*] --packages > org.deeplearning4j:deeplearning4j-core:0.7.0 > {code} > I end up with: > {code} > scala> import org.deeplearning4j > :23: error: object deeplearning4j is not a member of package org >import org.deeplearning4j > {code} > NOTE: It is working as expected when running on linux. > Sample output with --verbose: > {code} > Using properties file: null > Parsed arguments: > master local[*] > deployMode null > executorMemory null > executorCores null > totalExecutorCores null > propertiesFile null > driverMemorynull > driverCores null > driverExtraClassPathnull > driverExtraLibraryPath null > driverExtraJavaOptions null > supervise false > queue null > numExecutorsnull > files null > pyFiles null > archivesnull > mainClass org.apache.spark.repl.Main > primaryResource spark-shell > nameSpark shell > childArgs [] > jars > file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar > packagesnull > packagesExclusions null > repositoriesnull > verbose true > Spark properties used, including those specified through > --conf and those from the properties file null: > Main class: > org.apache.spark.repl.Main > Arguments: > System properties: > SPARK_SUBMIT -> true > spark.app.name -> Spark shell > spark.jars -> > file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar > spark.submit.deployMode -> client > spark.master -> local[*] > Classpath elements: > file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar > 16/11/30 08:30:49 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/11/30 08:30:51 WARN SparkContext: Use an existing SparkContext, some > configuration may not take effect. > Spark context Web UI available at http://192.168.70.164:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1480512651325). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.0.2 > /_/ > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101) > Type in expressions to have them evaluated. > Type :help for more information. > scala> import org.deeplearning4j > :23: error: object deeplearning4j is not a member of package org >import org.deeplearning4j > ^ > scala> > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
[ https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121888#comment-16121888 ] Hyukjin Kwon commented on SPARK-21693: -- Yes, it does build multiple times and If I have observed this correctly, it won't affect queuing particularly but it'd add roughly 25-30ish mins more for each build .. Will check out other possible things too and also try to check each time in each test in "MLlib classification algorithms". > AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests > - > > Key: SPARK-21693 > URL: https://issues.apache.org/jira/browse/SPARK-21693 > Project: Spark > Issue Type: Test > Components: Build, SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > We finally sometimes reach the time limit, 1.5 hours, > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master > I requested to increase this from an hour to 1.5 hours before but it looks we > should fix this in AppVeyor. I asked this for my account few times before but > it looks we can't increase this time limit again and again. > I could identify two things that look taking a quite a bit of time: > 1. Disabled cache feature in pull request builder, which ends up downloading > Maven dependencies (10-20ish mins) > https://www.appveyor.com/docs/build-cache/ > {quote} > Note: Saving cache is disabled in Pull Request builds. > {quote} > and also see > http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working > This seems difficult to fix within Spark. > 2. "MLlib classification algorithms" tests (30-35ish mins) > This test below looks taking 30-35ish mins. > {code} > MLlib classification algorithms, except for tree-based algorithms: Spark > package found in SPARK_HOME: C:\projects\spark\bin\.. > .. > {code} > As a (I think) last resort, we could make a matrix for this test alone, so > that we run the other tests after a build and then run this test after > another build, for example, I run Scala tests by this workaround - > https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix > with 7 build and test each). > I am also checking and testing other ways. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
[ https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121866#comment-16121866 ] Felix Cheung commented on SPARK-21693: -- splitting test matrix is also possible, I worry though since caching is disabled, then isn't Spark jar being built multiple times? My main concerns are how long tests will run and whether that will lengthen queuing of test runs (which could get quite long already and people are ignoring pending appveyor runs sometimes) > AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests > - > > Key: SPARK-21693 > URL: https://issues.apache.org/jira/browse/SPARK-21693 > Project: Spark > Issue Type: Test > Components: Build, SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > We finally sometimes reach the time limit, 1.5 hours, > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master > I requested to increase this from an hour to 1.5 hours before but it looks we > should fix this in AppVeyor. I asked this for my account few times before but > it looks we can't increase this time limit again and again. > I could identify two things that look taking a quite a bit of time: > 1. Disabled cache feature in pull request builder, which ends up downloading > Maven dependencies (10-20ish mins) > https://www.appveyor.com/docs/build-cache/ > {quote} > Note: Saving cache is disabled in Pull Request builds. > {quote} > and also see > http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working > This seems difficult to fix within Spark. > 2. "MLlib classification algorithms" tests (30-35ish mins) > This test below looks taking 30-35ish mins. > {code} > MLlib classification algorithms, except for tree-based algorithms: Spark > package found in SPARK_HOME: C:\projects\spark\bin\.. > .. > {code} > As a (I think) last resort, we could make a matrix for this test alone, so > that we run the other tests after a build and then run this test after > another build, for example, I run Scala tests by this workaround - > https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix > with 7 build and test each). > I am also checking and testing other ways. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
[ https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121860#comment-16121860 ] Felix Cheung commented on SPARK-21693: -- we could certainly simplify the classification set - but there's a fair number of API being tested in their, perhaps we could time them to see which ones are taking time. > AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests > - > > Key: SPARK-21693 > URL: https://issues.apache.org/jira/browse/SPARK-21693 > Project: Spark > Issue Type: Test > Components: Build, SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > We finally sometimes reach the time limit, 1.5 hours, > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master > I requested to increase this from an hour to 1.5 hours before but it looks we > should fix this in AppVeyor. I asked this for my account few times before but > it looks we can't increase this time limit again and again. > I could identify two things that look taking a quite a bit of time: > 1. Disabled cache feature in pull request builder, which ends up downloading > Maven dependencies (10-20ish mins) > https://www.appveyor.com/docs/build-cache/ > {quote} > Note: Saving cache is disabled in Pull Request builds. > {quote} > and also see > http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working > This seems difficult to fix within Spark. > 2. "MLlib classification algorithms" tests (30-35ish mins) > This test below looks taking 30-35ish mins. > {code} > MLlib classification algorithms, except for tree-based algorithms: Spark > package found in SPARK_HOME: C:\projects\spark\bin\.. > .. > {code} > As a (I think) last resort, we could make a matrix for this test alone, so > that we run the other tests after a build and then run this test after > another build, for example, I run Scala tests by this workaround - > https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix > with 7 build and test each). > I am also checking and testing other ways. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang resolved SPARK-21535. Resolution: Not A Problem The new implementation will load the evaluation dataset when training model and may not always present a better performance. Please refer to the discussion in the PR. > Reduce memory requirement for CrossValidator and TrainValidationSplit > -- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
[ https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-21693: - Description: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in AppVeyor. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify two things that look taking a quite a bit of time: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (10-20ish mins) https://www.appveyor.com/docs/build-cache/ {quote} Note: Saving cache is disabled in Pull Request builds. {quote} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). I am also checking and testing other ways. was: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in AppVeyor. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify three things that look taking a quite a bit of time: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (10-20ish mins) https://www.appveyor.com/docs/build-cache/ {quote} Note: Saving cache is disabled in Pull Request builds. {quote} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). 3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of {{mcfork}} See [this codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392]. We disabled this feature and currently fork processes from Java that is expensive. I haven't tested this yet but maybe reducing {{spark.sql.shuffle.partitions}} can be an approach to work around this. Currently, if I understood correctly, this is 200 by default in R tests, which ends up with 200 Java processes for every shuffle. > AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests > - > > Key: SPARK-21693 > URL: https://issues.apache.org/jira/browse/SPARK-21693 > Project: Spark > Issue Type: Test > Components: Build, SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > We finally sometimes reach the time limit, 1.5 hours, > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master > I requested to increase this from an hour to 1.5 hours before but it looks we > should fix this in AppVeyor. I asked this for my account few times before but > it looks we can't increase this time limit again and again. > I could identify two things that look taking a quite a bit of time: > 1. Disabled cache feature in pull request builder, which ends up downloading > Maven dependencies (10-20ish mins) > https://www.appveyor.com/docs/build-cache/ > {quote} > Note: Saving cache is disabled in Pull Request builds. > {quote} > and also see > http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working > This seems difficult to fix within Spark. > 2.
[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
[ https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-21693: - Description: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in AppVeyor. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify three things that look taking a quite a bit of time: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (10-20ish mins) https://www.appveyor.com/docs/build-cache/ {quote} Note: Saving cache is disabled in Pull Request builds. {quote} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). 3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of {{mcfork}} See [this codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392]. We disabled this feature and currently fork processes from Java that is expensive. I haven't tested this yet but maybe reducing {{spark.sql.shuffle.partitions}} can be an approach to work around this. Currently, if I understood correctly, this is 200 by default in R tests, which ends up with 200 Java processes for every shuffle. was: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in AppVeyor. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify three things that look taking a quite a bit of times: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (10-20ish mins) https://www.appveyor.com/docs/build-cache/ {quote} Note: Saving cache is disabled in Pull Request builds. {quote} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). 3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of {{mcfork}} See [this codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392]. We disabled this feature and currently fork processes from Java that is expensive. I haven't tested this yet but maybe reducing {{spark.sql.shuffle.partitions}} can be an approach to work around this. Currently, if I understood correctly, this is 200 by default in R tests, which ends up with 200 Java processes for every shuffle. > AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests > - > > Key: SPARK-21693 > URL: https://issues.apache.org/jira/browse/SPARK-21693 > Project: Spark > Issue Type: Test > Components: Build, SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > We finally sometimes reach the time limit, 1.5 hours, > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master > I requested to increase this from an hour to 1.5 hours before but it looks we > should fix this in AppVeyor. I asked this for my
[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
[ https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-21693: - Description: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in AppVeyor. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify three things that look taking a quite a bit of times: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (10-20ish mins) https://www.appveyor.com/docs/build-cache/ {quote} Note: Saving cache is disabled in Pull Request builds. {quote} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). 3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of {{mcfork}} See [this codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392]. We disabled this feature and currently fork processes from Java that is expensive. I haven't tested this yet but maybe reducing {{spark.sql.shuffle.partitions}} can be an approach to work around this. Currently, if I understood correctly, this is 200 by default in R tests, which ends up with 200 Java processes for every shuffle. was: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in AppVeyor. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify three things that take a quite a bit of times: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (15-20ish mins) https://www.appveyor.com/docs/build-cache/ {quote} Note: Saving cache is disabled in Pull Request builds. {quote} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). 3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of {{mcfork}} See [this codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392]. We disabled this feature and currently fork processes from Java that is expensive. I haven't tested this yet but maybe reducing {{spark.sql.shuffle.partitions}} can be an approach to work around this. Currently, if I understood correctly, this is 200 by default in R tests, which ends up with 200 Java processes for every shuffle. > AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests > - > > Key: SPARK-21693 > URL: https://issues.apache.org/jira/browse/SPARK-21693 > Project: Spark > Issue Type: Test > Components: Build, SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > We finally sometimes reach the time limit, 1.5 hours, > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master > I requested to increase this from an hour to 1.5 hours before but it looks we > should fix this in AppVeyor. I asked this for my account few
[jira] [Commented] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
[ https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121827#comment-16121827 ] Hyukjin Kwon commented on SPARK-21693: -- FYI, [~felixcheung] and [~shivaram]. > AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests > - > > Key: SPARK-21693 > URL: https://issues.apache.org/jira/browse/SPARK-21693 > Project: Spark > Issue Type: Test > Components: Build, SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > We finally sometimes reach the time limit, 1.5 hours, > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master > I requested to increase this from an hour to 1.5 hours before but it looks we > should fix this in AppVeyor. I asked this for my account few times before but > it looks we can't increase this time limit again and again. > I could identify three things that take a quite a bit of times: > 1. Disabled cache feature in pull request builder, which ends up downloading > Maven dependencies (15-20ish mins) > https://www.appveyor.com/docs/build-cache/ > {quote} > Note: Saving cache is disabled in Pull Request builds. > {quote} > and also see > http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working > This seems difficult to fix within Spark. > 2. "MLlib classification algorithms" tests (30-35ish mins) > This test below looks taking 30-35ish mins. > {code} > MLlib classification algorithms, except for tree-based algorithms: Spark > package found in SPARK_HOME: C:\projects\spark\bin\.. > .. > {code} > As a (I think) last resort, we could make a matrix for this test alone, so > that we run the other tests after a build and then run this test after > another build, for example, I run Scala tests by this workaround - > https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix > with 7 build and test each). > 3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of > {{mcfork}} > See [this > codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392]. > We disabled this feature and currently fork processes from Java that is > expensive. I haven't tested this yet but maybe reducing > {{spark.sql.shuffle.partitions}} can be an approach to work around this. > Currently, if I understood correctly, this is 200 by default in R tests, > which ends up with 200 Java processes for every shuffle. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
[ https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-21693: - Description: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in AppVeyor. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify three things that take a quite a bit of times: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (15-20ish mins) https://www.appveyor.com/docs/build-cache/ {quote} Note: Saving cache is disabled in Pull Request builds. {quote} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). 3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of {{mcfork}} See [this codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392]. We disabled this feature and currently fork processes from Java that is expensive. I haven't tested this yet but maybe reducing {{spark.sql.shuffle.partitions}} can be an approach to work around this. Currently, if I understood correctly, this is 200 by default in R tests, which ends up with 200 Java processes for every shuffle. was: We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in AppVeyor. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify three things that take a quite a bit of times: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (15-20ish mins) https://www.appveyor.com/docs/build-cache/ {code} Note: Saving cache is disabled in Pull Request builds. {code} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). 3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of {{mcfork}} See [this codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392]. We disabled this feature and currently fork processes from Java that is expensive. I haven't tested this yet but maybe reducing {{spark.sql.shuffle.partitions}} can be an approach to work around this. Currently, if I understood correctly, this is 200 by default in R tests, which ends up with 200 Java processes for every shuffle. > AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests > - > > Key: SPARK-21693 > URL: https://issues.apache.org/jira/browse/SPARK-21693 > Project: Spark > Issue Type: Test > Components: Build, SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > We finally sometimes reach the time limit, 1.5 hours, > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master > I requested to increase this from an hour to 1.5 hours before but it looks we > should fix this in AppVeyor. I asked this for my account few times
[jira] [Created] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
Hyukjin Kwon created SPARK-21693: Summary: AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests Key: SPARK-21693 URL: https://issues.apache.org/jira/browse/SPARK-21693 Project: Spark Issue Type: Test Components: Build, SparkR Affects Versions: 2.3.0 Reporter: Hyukjin Kwon We finally sometimes reach the time limit, 1.5 hours, https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master I requested to increase this from an hour to 1.5 hours before but it looks we should fix this in AppVeyor. I asked this for my account few times before but it looks we can't increase this time limit again and again. I could identify three things that take a quite a bit of times: 1. Disabled cache feature in pull request builder, which ends up downloading Maven dependencies (15-20ish mins) https://www.appveyor.com/docs/build-cache/ {code} Note: Saving cache is disabled in Pull Request builds. {code} and also see http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working This seems difficult to fix within Spark. 2. "MLlib classification algorithms" tests (30-35ish mins) This test below looks taking 30-35ish mins. {code} MLlib classification algorithms, except for tree-based algorithms: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. .. {code} As a (I think) last resort, we could make a matrix for this test alone, so that we run the other tests after a build and then run this test after another build, for example, I run Scala tests by this workaround - https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix with 7 build and test each). 3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of {{mcfork}} See [this codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392]. We disabled this feature and currently fork processes from Java that is expensive. I haven't tested this yet but maybe reducing {{spark.sql.shuffle.partitions}} can be an approach to work around this. Currently, if I understood correctly, this is 200 by default in R tests, which ends up with 200 Java processes for every shuffle. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18648) spark-shell --jars option does not add jars to classpath on windows
[ https://issues.apache.org/jira/browse/SPARK-18648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121817#comment-16121817 ] Devaraj K commented on SPARK-18648: --- [~FlamingMike], It has fixed as part of SPARK-21339, can you check this issue with SPARK-21339 change if you have chance? Thanks > spark-shell --jars option does not add jars to classpath on windows > --- > > Key: SPARK-18648 > URL: https://issues.apache.org/jira/browse/SPARK-18648 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Windows >Affects Versions: 2.0.2 > Environment: Windows 7 x64 >Reporter: Michel Lemay > Labels: windows > > I can't import symbols from command line jars when in the shell: > Adding jars via --jars: > {code} > spark-shell --master local[*] --jars path\to\deeplearning4j-core-0.7.0.jar > {code} > Same result if I add it through maven coordinates: > {code}spark-shell --master local[*] --packages > org.deeplearning4j:deeplearning4j-core:0.7.0 > {code} > I end up with: > {code} > scala> import org.deeplearning4j > :23: error: object deeplearning4j is not a member of package org >import org.deeplearning4j > {code} > NOTE: It is working as expected when running on linux. > Sample output with --verbose: > {code} > Using properties file: null > Parsed arguments: > master local[*] > deployMode null > executorMemory null > executorCores null > totalExecutorCores null > propertiesFile null > driverMemorynull > driverCores null > driverExtraClassPathnull > driverExtraLibraryPath null > driverExtraJavaOptions null > supervise false > queue null > numExecutorsnull > files null > pyFiles null > archivesnull > mainClass org.apache.spark.repl.Main > primaryResource spark-shell > nameSpark shell > childArgs [] > jars > file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar > packagesnull > packagesExclusions null > repositoriesnull > verbose true > Spark properties used, including those specified through > --conf and those from the properties file null: > Main class: > org.apache.spark.repl.Main > Arguments: > System properties: > SPARK_SUBMIT -> true > spark.app.name -> Spark shell > spark.jars -> > file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar > spark.submit.deployMode -> client > spark.master -> local[*] > Classpath elements: > file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar > 16/11/30 08:30:49 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/11/30 08:30:51 WARN SparkContext: Use an existing SparkContext, some > configuration may not take effect. > Spark context Web UI available at http://192.168.70.164:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1480512651325). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.0.2 > /_/ > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101) > Type in expressions to have them evaluated. > Type :help for more information. > scala> import org.deeplearning4j > :23: error: object deeplearning4j is not a member of package org >import org.deeplearning4j > ^ > scala> > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121801#comment-16121801 ] Ruslan Dautkhanov commented on SPARK-21657: --- [~bjornjons] confirms this problem pertains to Spark 2.2 too. > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sizes nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling 50,000 it took 7 hours to explode the nested collections (\!) of > 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21692) Modify PythonUDF to support nullability
Michael Styles created SPARK-21692: -- Summary: Modify PythonUDF to support nullability Key: SPARK-21692 URL: https://issues.apache.org/jira/browse/SPARK-21692 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.2.0 Reporter: Michael Styles When creating or registering Python UDFs, a user may know whether null values can be returned by the function. PythonUDF and related classes should be modified to support nullability. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.
[ https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121740#comment-16121740 ] Liang-Chi Hsieh commented on SPARK-21677: - As a given field name {{null}} can't be matched with any field names in json, we just output {{null}} as its column value. I think it's reasonable. > json_tuple throws NullPointException when column is null as string type. > > > Key: SPARK-21677 > URL: https://issues.apache.org/jira/browse/SPARK-21677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: Starter > > I was testing {{json_tuple}} before using this to extract values from JSONs > in my testing cluster but I found it could actually throw > {{NullPointException}} as below sometimes: > {code} > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show() > +---+ > | c0| > +---+ > |224| > +---+ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show() > ++ > | c0| > ++ > |null| > ++ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() > ... > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400) > {code} > It sounds we should show explicit error messages or return {{NULL}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121679#comment-16121679 ] Sean Owen commented on SPARK-21688: --- Not a good solution? How about just checking the env variables? Simple and better than nothing > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, > native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.
[ https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121658#comment-16121658 ] Hyukjin Kwon edited comment on SPARK-21677 at 8/10/17 2:12 PM: --- [~cjm], I was thinking like {code} spark.sql("""SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a')""").show() ++---++---+ | c0| c1| c2| c3| ++---++---+ |null| 2|null| 1| ++---++---+ {code} I think this could be at least consistent with Hive's implementation: {code} hive> SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a'); ... NULL2 NULL1 {code} was (Author: hyukjin.kwon): [~cjm], I was thinking like {code} spark.sql("""SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a');""").show() ++---++---+ | c0| c1| c2| c3| ++---++---+ |null| 2|null| 1| ++---++---+ {code} I think this could be at least consistent with Hive's implementation: {code} hive> SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a'); ... NULL2 NULL1 {code} > json_tuple throws NullPointException when column is null as string type. > > > Key: SPARK-21677 > URL: https://issues.apache.org/jira/browse/SPARK-21677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: Starter > > I was testing {{json_tuple}} before using this to extract values from JSONs > in my testing cluster but I found it could actually throw > {{NullPointException}} as below sometimes: > {code} > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show() > +---+ > | c0| > +---+ > |224| > +---+ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show() > ++ > | c0| > ++ > |null| > ++ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() > ... > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400) > {code} > It sounds we should show explicit error messages or return {{NULL}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.
[ https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121658#comment-16121658 ] Hyukjin Kwon commented on SPARK-21677: -- [~cjm], I was thinking like {code} spark.sql("""SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a');""").show() ++---++---+ | c0| c1| c2| c3| ++---++---+ |null| 2|null| 1| ++---++---+ {code} I think this could be at least consistent with Hive's implementation: {code} hive> SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', cast(NULL AS STRING), 'a'); ... NULL2 NULL1 {code} > json_tuple throws NullPointException when column is null as string type. > > > Key: SPARK-21677 > URL: https://issues.apache.org/jira/browse/SPARK-21677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: Starter > > I was testing {{json_tuple}} before using this to extract values from JSONs > in my testing cluster but I found it could actually throw > {{NullPointException}} as below sometimes: > {code} > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show() > +---+ > | c0| > +---+ > |224| > +---+ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show() > ++ > | c0| > ++ > |null| > ++ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() > ... > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400) > {code} > It sounds we should show explicit error messages or return {{NULL}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them
[ https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-21656: -- Description: Right now with dynamic allocation spark starts by getting the number of executors it needs to run all the tasks in parallel (or the configured maximum) for that stage. After it gets that number it will never reacquire more unless either an executor dies, is explicitly killed by yarn or it goes to the next stage. The dynamic allocation manager has the concept of idle timeout. Currently this says if a task hasn't been scheduled on that executor for a configurable amount of time (60 seconds by default), then let that executor go. Note when it lets that executor go due to the idle timeout it never goes back to see if it should reacquire more. This is a problem for multiple reasons: 1 . Things can happen in the system that are not expected that can cause delays. Spark should be resilient to these. If the driver is GC'ing, you have network delays, etc we could idle timeout executors even though there are tasks to run on them its just the scheduler hasn't had time to start those tasks. Note that in the worst case this allows the number of executors to go to 0 and we have a deadlock. 2. Internal Spark components have opposing requirements. The scheduler has a requirement to try to get locality, the dynamic allocation doesn't know about this and if it lets the executors go it hurts the scheduler from doing what it was designed to do. For example the scheduler first tries to schedule node local, during this time it can skip scheduling on some executors. After a while though the scheduler falls back from node local to scheduler on rack local, and then eventually on any node. So during when the scheduler is doing node local scheduling, the other executors can idle timeout. This means that when the scheduler does fall back to rack or any locality where it would have used those executors, we have already let them go and it can't scheduler all the tasks it could which can have a huge negative impact on job run time. In both of these cases when the executors idle timeout we never go back to check to see if we need more executors (until the next stage starts). In the worst case you end up with 0 and deadlock, but generally this shows itself by just going down to very few executors when you could have 10's of thousands of tasks to run on them, which causes the job to take way more time (in my case I've seen it should take minutes and it takes hours due to only been left a few executors). We should handle these situations in Spark. The most straight forward approach would be to not allow the executors to idle timeout when there are tasks that could run on those executors. This would allow the scheduler to do its job with locality scheduling. In doing this it also fixes number 1 above because you never can go into a deadlock as it will keep enough executors to run all the tasks on. There are other approaches to fix this, like explicitly prevent it from going to 0 executors, that prevents a deadlock but can still cause the job to slowdown greatly. We could also change it at some point to just re-check to see if we should get more executors, but this adds extra logic, we would have to decide when to check, its also just overhead in letting them go and then re-acquiring them again and this would cause some slowdown in the job as the executors aren't immediately there for the scheduler to place things on. was: Right now spark lets go of executors when they are idle for the 60s (or configurable time). I have seen spark let them go when they are idle but they were really needed. I have seen this issue when the scheduler was waiting to get node locality but that takes longer then the default idle timeout. In these jobs the number of executors goes down really small (less than 10) but there are still like 80,000 tasks to run. We should consider not allowing executors to idle timeout if they are still needed according to the number of tasks to be run. There are multiple reasons: 1 . Things can happen in the system that are not expected that can cause delays. Spark should be resilient to these. If the driver is GC'ing, you have network delays, etc we could idle timeout executors even though there are tasks to run on them its just the scheduler hasn't had time to start those tasks. these just slow down the users job, the user does not want this. 2. Internal Spark components have opposing requirements. The scheduler has a requirement to try to get locality, the dynamic allocation doesn't know about this and it giving away executors it hurting the scheduler from doing what it was designed to do. Ideally we have enough executors to run all the tasks on. If dynamic allocation allows those to idle timeout the scheduler can not make proper
[jira] [Commented] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.
[ https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121641#comment-16121641 ] Jen-Ming Chung commented on SPARK-21677: to [~hyukjin.kwon], the return {{NULL}} you mentioned does it means all fields should be null in json_tuple, or just the non-existence field as shown in the following. Thanks! {code:language=scala|borderStyle=solid} e.g., spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'not_exising_fields')""").show() +---+---++ | c0| c1| c2| +---+---++ | 1| 2|null| +---+---++ {code} > json_tuple throws NullPointException when column is null as string type. > > > Key: SPARK-21677 > URL: https://issues.apache.org/jira/browse/SPARK-21677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: Starter > > I was testing {{json_tuple}} before using this to extract values from JSONs > in my testing cluster but I found it could actually throw > {{NullPointException}} as below sometimes: > {code} > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show() > +---+ > | c0| > +---+ > |224| > +---+ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show() > ++ > | c0| > ++ > |null| > ++ > scala> Seq(("""{"Hyukjin": 224, "John": > 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() > ... > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559) > at > org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400) > {code} > It sounds we should show explicit error messages or return {{NULL}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121624#comment-16121624 ] Vincent commented on SPARK-21688: - Okay. Yes, true. It can still run without issue but we are just offering another choice for those who wanna have 50% speedup or more by using native BLAS in their case, they can also stick to F2J with a simple setting in spark configuration. the problem for default thread settings has been discussed in https://issues.apache.org/jira/browse/SPARK-21305. I believe it's non-trivial but seems it's a common issue for all native blas implementations, there's not a good solution to this issue for now. > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, > native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21656) spark dynamic allocation should not idle timeout executors when tasks still to run
[ https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-21656: -- Description: Right now spark lets go of executors when they are idle for the 60s (or configurable time). I have seen spark let them go when they are idle but they were really needed. I have seen this issue when the scheduler was waiting to get node locality but that takes longer then the default idle timeout. In these jobs the number of executors goes down really small (less than 10) but there are still like 80,000 tasks to run. We should consider not allowing executors to idle timeout if they are still needed according to the number of tasks to be run. There are multiple reasons: 1 . Things can happen in the system that are not expected that can cause delays. Spark should be resilient to these. If the driver is GC'ing, you have network delays, etc we could idle timeout executors even though there are tasks to run on them its just the scheduler hasn't had time to start those tasks. these just slow down the users job, the user does not want this. 2. Internal Spark components have opposing requirements. The scheduler has a requirement to try to get locality, the dynamic allocation doesn't know about this and it giving away executors it hurting the scheduler from doing what it was designed to do. Ideally we have enough executors to run all the tasks on. If dynamic allocation allows those to idle timeout the scheduler can not make proper decisions. In the end this hurts users by affects the job. A user should not have to mess with the configs to keep this basic behavior. was: Right now spark lets go of executors when they are idle for the 60s (or configurable time). I have seen spark let them go when they are idle but they were really needed. I have seen this issue when the scheduler was waiting to get node locality but that takes longer then the default idle timeout. In these jobs the number of executors goes down really small (less than 10) but there are still like 80,000 tasks to run. We should consider not allowing executors to idle timeout if they are still needed according to the number of tasks to be run. > spark dynamic allocation should not idle timeout executors when tasks still > to run > -- > > Key: SPARK-21656 > URL: https://issues.apache.org/jira/browse/SPARK-21656 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Jong Yoon Lee > Original Estimate: 24h > Remaining Estimate: 24h > > Right now spark lets go of executors when they are idle for the 60s (or > configurable time). I have seen spark let them go when they are idle but they > were really needed. I have seen this issue when the scheduler was waiting to > get node locality but that takes longer then the default idle timeout. In > these jobs the number of executors goes down really small (less than 10) but > there are still like 80,000 tasks to run. > We should consider not allowing executors to idle timeout if they are still > needed according to the number of tasks to be run. > There are multiple reasons: > 1 . Things can happen in the system that are not expected that can cause > delays. Spark should be resilient to these. If the driver is GC'ing, you have > network delays, etc we could idle timeout executors even though there are > tasks to run on them its just the scheduler hasn't had time to start those > tasks. these just slow down the users job, the user does not want this. > 2. Internal Spark components have opposing requirements. The scheduler has a > requirement to try to get locality, the dynamic allocation doesn't know about > this and it giving away executors it hurting the scheduler from doing what it > was designed to do. > Ideally we have enough executors to run all the tasks on. If dynamic > allocation allows those to idle timeout the scheduler can not make proper > decisions. In the end this hurts users by affects the job. A user should not > have to mess with the configs to keep this basic behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them
[ https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-21656: -- Summary: spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them (was: spark dynamic allocation should not idle timeout executors when tasks still to run) > spark dynamic allocation should not idle timeout executors when there are > enough tasks to run on them > - > > Key: SPARK-21656 > URL: https://issues.apache.org/jira/browse/SPARK-21656 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Jong Yoon Lee > Original Estimate: 24h > Remaining Estimate: 24h > > Right now spark lets go of executors when they are idle for the 60s (or > configurable time). I have seen spark let them go when they are idle but they > were really needed. I have seen this issue when the scheduler was waiting to > get node locality but that takes longer then the default idle timeout. In > these jobs the number of executors goes down really small (less than 10) but > there are still like 80,000 tasks to run. > We should consider not allowing executors to idle timeout if they are still > needed according to the number of tasks to be run. > There are multiple reasons: > 1 . Things can happen in the system that are not expected that can cause > delays. Spark should be resilient to these. If the driver is GC'ing, you have > network delays, etc we could idle timeout executors even though there are > tasks to run on them its just the scheduler hasn't had time to start those > tasks. these just slow down the users job, the user does not want this. > 2. Internal Spark components have opposing requirements. The scheduler has a > requirement to try to get locality, the dynamic allocation doesn't know about > this and it giving away executors it hurting the scheduler from doing what it > was designed to do. > Ideally we have enough executors to run all the tasks on. If dynamic > allocation allows those to idle timeout the scheduler can not make proper > decisions. In the end this hurts users by affects the job. A user should not > have to mess with the configs to keep this basic behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121599#comment-16121599 ] Sean Owen commented on SPARK-21688: --- I mean best case in that MKL might be a little different from OpenBLAS but this is minor. I suppose this won't impact users with no acceleration. What is the slowdown for those who use default thread settings? Because that's the most common scenario. If it's non trivial we can't just ignore it. > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, > native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121589#comment-16121589 ] Vincent commented on SPARK-21688: - [~srowen] Thanks for your comments. I think if user decides to use native blas, they should be aware of the threading configuration impacts, checking this env variable in mllib doesnt make sense; and no, actually we didn't just present the best-case result, instead, we took the average value of the 3-run tests for each case, and the result shows, for small dataset native blas might not have advantage over f2j, but the gap is small and we would expect that big data processing is more common case here. > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, > native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21691) Accessing canonicalized plan for query with limit throws exception
Bjoern Toldbod created SPARK-21691: -- Summary: Accessing canonicalized plan for query with limit throws exception Key: SPARK-21691 URL: https://issues.apache.org/jira/browse/SPARK-21691 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Bjoern Toldbod Accessing the logical, canonicalized plan fails for queries with limits. The following demonstrates the issue: {code:java} val session = SparkSession.builder.master("local").getOrCreate() // This works session.sql("select * from (values 0, 1)").queryExecution.logical.canonicalized // This fails session.sql("select * from (values 0, 1) limit 1").queryExecution.logical.canonicalized {code} The message in the thrown exception is somewhat confusing (or at least not directly related to the limit): "Invalid call to toAttribute on unresolved object, tree: *" -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList
[ https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121401#comment-16121401 ] Paul Praet commented on SPARK-21402: It seems changing the order of the fields in the struct can give some improvements but when I add more fields, the problem just gets worse - some fields just never get filled in or twice. > Java encoders - switch fields on collectAsList > -- > > Key: SPARK-21402 > URL: https://issues.apache.org/jira/browse/SPARK-21402 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: mac os > spark 2.1.1 > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121 >Reporter: Tom > > I have the following schema in a dataset - > root > |-- userId: string (nullable = true) > |-- data: map (nullable = true) > ||-- key: string > ||-- value: struct (valueContainsNull = true) > |||-- startTime: long (nullable = true) > |||-- endTime: long (nullable = true) > |-- offset: long (nullable = true) > And I have the following classes (+ setter and getters which I omitted for > simplicity) - > > {code:java} > public class MyClass { > private String userId; > private Mapdata; > private Long offset; > } > public class MyDTO { > private long startTime; > private long endTime; > } > {code} > I collect the result the following way - > {code:java} > Encoder myClassEncoder = Encoders.bean(MyClass.class); > Dataset results = raw_df.as(myClassEncoder); > List lst = results.collectAsList(); > {code} > > I do several calculations to get the result I want and the result is correct > all through the way before I collect it. > This is the result for - > {code:java} > results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false); > {code} > |data[2017-07-01].startTime|data[2017-07-01].endTime| > +-+--+ > |1498854000|1498870800 | > This is the result after collecting the reuslts for - > {code:java} > MyClass userData = results.collectAsList().get(0); > MyDTO userDTO = userData.getData().get("2017-07-01"); > System.out.println("userDTO startTime: " + userDTO.getStartTime()); > System.out.println("userDTO endTime: " + userDTO.getEndTime()); > {code} > -- > data startTime: 1498870800 > data endTime: 1498854000 > I tend to believe it is a spark issue. Would love any suggestions on how to > bypass it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer
[ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21520: -- Description: Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects and filters. Deal with that we only need to read user needs fields for non-deterministic projects and filters in optimizer. For example, the fields of project contains nondeterministic function(rand function), after a executedPlan optimizer generated: *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table HiveTableScan will read all the fields from table. but we only need to ‘d004’ . it will affect the performance of task. was: Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects and filters. Deal with that we only need to read user needs fields for non-deterministic projects and filters in optimizer. > Improvement a special case for non-deterministic projects and filters in > optimizer > -- > > Key: SPARK-21520 > URL: https://issues.apache.org/jira/browse/SPARK-21520 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, Did a lot of special handling for non-deterministic projects and > filters in optimizer. but not good enough. this patch add a new special case > for non-deterministic projects and filters. Deal with that we only need to > read user needs fields for non-deterministic projects and filters in > optimizer. > For example, the fields of project contains nondeterministic function(rand > function), after a executedPlan optimizer generated: > *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as > bigint))], output=[k#403L, sum#800L]) > +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) > AS k#403L] >+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, > d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, > d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, > c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], > MetastoreRelation XXX_database, XXX_table > HiveTableScan will read all the fields from table. but we only need to ‘d004’ > . it will affect the performance of task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20971) Purge the metadata log for FileStreamSource
[ https://issues.apache.org/jira/browse/SPARK-20971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Shao updated SPARK-20971: - Comment: was deleted (was: Hi Zhu, What does metadata logs stands for please?) > Purge the metadata log for FileStreamSource > --- > > Key: SPARK-20971 > URL: https://issues.apache.org/jira/browse/SPARK-20971 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Shixiong Zhu > > Currently > [FileStreamSource.commit|https://github.com/apache/spark/blob/16186cdcbce1a2ec8f839c550e6b571bf5dc2692/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L258] > is empty. We can delete unused metadata logs in this method to reduce the > size of log files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21402) Java encoders - switch fields on collectAsList
[ https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121297#comment-16121297 ] Paul Praet edited comment on SPARK-21402 at 8/10/17 9:04 AM: - I can confirm this problem persists in Spark 2.2.0: fields get all swapped when you use the bean encoder on a dataset with an array of structs. A plain struct works, an array of structs does not. Pretty big issue if you ask me. {noformat} root |-- writeKey: string (nullable = false) |-- id: string (nullable = false) |-- type: string (nullable = false) |-- ssid: string (nullable = false) ++--+++ |writeKey|id|type|ssid| ++--+++ |someWriteKey|someId|someType|someSSID| ++--+++ {noformat} When I convert into a struct, everything is still fine: {noformat} root |-- writeKey: string (nullable = false) |-- nodes: struct (nullable = false) ||-- id: string (nullable = false) ||-- type: string (nullable = false) ||-- ssid: string (nullable = false) ++--+ |writeKey|nodes | ++--+ |someWriteKey|[someId,someType,someSSID]| ++--+ {noformat} When I do a groupBy on writeKey and a collect_set() on the nodes, we get: {noformat} root |-- writeKey: string (nullable = false) |-- nodes: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: string (nullable = false) |||-- type: string (nullable = false) |||-- ssid: string (nullable = false) +++ |writeKey|nodes | +++ |someWriteKey|[[someId,someType,someSSID]]| +++ {noformat} When I convert this to Java... {code:java} Dataset dfArray = dfStruct.groupBy("writeKey") .agg(functions.collect_set("nodes").alias("nodes")); Encoder topologyEncoder = Encoders.bean(Topology.class); Dataset datasetMultiple = dfArray.as(topologyEncoder); System.out.println(datasetMultiple.first()); {code} This prints: {noformat} Topology{writeKey='someWriteKey', nodes=[Node{id='someId', type='someSSID', ssid='someType'}]} {noformat} You can clearly see the type and ssid fields were swapped. POJO classes: {code:java} public static class Topology { private String writeKey; private List nodes; public Topology() { } public String getWriteKey() { return writeKey; } public void setWriteKey(String writeKey) { this.writeKey = writeKey; } public List getNodes() { return nodes; } public void setNodes(List nodes) { this.nodes = nodes; } @Override public String toString() { return "Topology{" + "writeKey='" + writeKey + '\'' + ", nodes=" + nodes + '}'; } } public static class Node { private String id; private String type; private String ssid; public Node() { } public String getId() { return id; } public void setId(String id) { this.id = id; } public String getType() { return type; } public void setType(String type) { this.type = type; } public String getSsid() { return ssid; } public void setSsid(String ssid) { this.ssid = ssid; } @Override public String toString() { return "Node{" + "id='" + id + '\'' + ", type='" + type + '\'' + ", ssid='" + ssid + '\'' + '}'; } } {code} was (Author: praetp): I can confirm this problem persists in Spark 2.2.0: fields get all swapped when you use the bean encoder on a dataset with an array of structs. A plain struct works, an array of structs does not. Pretty big issue if you ask me. {noformat} root |-- writeKey: string (nullable = false) |-- id: string (nullable = false) |-- type: string (nullable = false) |-- ssid: string (nullable = false) ++--+++ |writeKey|id|type|ssid| ++--+++ |someWriteKey|someId|someType|someSSID| ++--+++ {noformat} When I convert into a struct, everything is still fine: {noformat} root |-- writeKey: string (nullable = false) |-- nodes: struct (nullable = false) ||-- id: string (nullable = false) ||-- type: string (nullable = false) ||-- ssid: string (nullable = false)
[jira] [Comment Edited] (SPARK-21402) Java encoders - switch fields on collectAsList
[ https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121297#comment-16121297 ] Paul Praet edited comment on SPARK-21402 at 8/10/17 9:03 AM: - I can confirm this problem persists in Spark 2.2.0: fields get all swapped when you use the bean encoder on a dataset with an array of structs. A plain struct works, an array of structs does not. Pretty big issue if you ask me. {noformat} root |-- writeKey: string (nullable = false) |-- id: string (nullable = false) |-- type: string (nullable = false) |-- ssid: string (nullable = false) ++--+++ |writeKey|id|type|ssid| ++--+++ |someWriteKey|someId|someType|someSSID| ++--+++ {noformat} When I convert into a struct, everything is still fine: {noformat} root |-- writeKey: string (nullable = false) |-- nodes: struct (nullable = false) ||-- id: string (nullable = false) ||-- type: string (nullable = false) ||-- ssid: string (nullable = false) ++--+ |writeKey|nodes | ++--+ |someWriteKey|[someId,someType,someSSID]| ++--+ {noformat} When I do a groupBy on writeKey and a collect_set() on the nodes, we get: {noformat} root |-- writeKey: string (nullable = false) |-- nodes: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: string (nullable = false) |||-- type: string (nullable = false) |||-- ssid: string (nullable = false) +++ |writeKey|nodes | +++ |someWriteKey|[[someId,someType,someSSID]]| +++ {noformat} When I convert this to Java... {code:java} Dataset dfArray = dfStruct.groupBy("writeKey") .agg(functions.collect_set("nodes").alias("nodes")); Encoder topologyEncoder = Encoders.bean(Topology.class); Dataset datasetMultiple = dfArray.as(topologyEncoder); System.out.println(datasetMultiple.first()); {code} This prints: {noformat} Topology{writeKey='someWriteKey', nodes=[Node{id='someId', type='someSSID', ssid='someType'}]} {noformat} You can clearly see the type and ssid fields were swapped. POJO classes: {code:java} public static class Topology { private String writeKey; private List nodes; public Topology() { } public String getWriteKey() { return writeKey; } public void setWriteKey(String writeKey) { this.writeKey = writeKey; } public List getNodes() { return nodes; } public void setNodes(List nodes) { this.nodes = nodes; } @Override public String toString() { return "Topology{" + "writeKey='" + writeKey + '\'' + ", nodes=" + nodes + '}'; } } public static class Node { private String id; private String type; private String ssid; public Node() { } public String getId() { return id; } public void setId(String id) { this.id = id; } public String getType() { return type; } public void setType(String type) { this.type = type; } public String getSsid() { return ssid; } public void setSsid(String ssid) { this.ssid = ssid; } @Override public String toString() { return "Node{" + "id='" + id + '\'' + ", type='" + type + '\'' + ", ssid='" + ssid + '\'' + '}'; } } {code} was (Author: praetp): I can confirm this problem persists in Spark 2.2.0: fields get all swapped when you use the bean encoder on a dataset with an array of structs. A plain struct works, an array of structs does not. Pretty big issue if you ask me. I have a datamodel like this (all Strings) {noformat} ++--+++ |writeKey|id|type|ssid| ++--+++ |someWriteKey|someId|someType|someSSID| ++--+++ {noformat} When I convert into a struct, everything is still fine: {noformat} root |-- writeKey: string (nullable = false) |-- nodes: struct (nullable = false) ||-- id: string (nullable = false) ||-- type: string (nullable = false) ||-- ssid: string (nullable = false) ++--+ |writeKey|nodes |
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121299#comment-16121299 ] Sean Owen commented on SPARK-21688: --- Understood, though it potentially impacts the benchmarks. You have a best-case result here. > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, > native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList
[ https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121297#comment-16121297 ] Paul Praet commented on SPARK-21402: I can confirm this problem persists in Spark 2.2.0: fields get all swapped when you use the bean encoder on a dataset with an array of structs. A plain struct works, an array of structs does not. Pretty big issue if you ask me. I have a datamodel like this (all Strings) {noformat} ++--+++ |writeKey|id|type|ssid| ++--+++ |someWriteKey|someId|someType|someSSID| ++--+++ {noformat} When I convert into a struct, everything is still fine: {noformat} root |-- writeKey: string (nullable = false) |-- nodes: struct (nullable = false) ||-- id: string (nullable = false) ||-- type: string (nullable = false) ||-- ssid: string (nullable = false) ++--+ |writeKey|nodes | ++--+ |someWriteKey|[someId,someType,someSSID]| ++--+ {noformat} When I do a groupBy on writeKey and a collect_set() on the nodes, we get: {noformat} root |-- writeKey: string (nullable = false) |-- nodes: array (nullable = true) ||-- element: struct (containsNull = true) |||-- id: string (nullable = false) |||-- type: string (nullable = false) |||-- ssid: string (nullable = false) +++ |writeKey|nodes | +++ |someWriteKey|[[someId,someType,someSSID]]| +++ {noformat} When I convert this to Java... {code:java} Dataset dfArray = dfStruct.groupBy("writeKey") .agg(functions.collect_set("nodes").alias("nodes")); Encoder topologyEncoder = Encoders.bean(Topology.class); Dataset datasetMultiple = dfArray.as(topologyEncoder); System.out.println(datasetMultiple.first()); {code} This prints: {noformat} Topology{writeKey='someWriteKey', nodes=[Node{id='someId', type='someSSID', ssid='someType'}]} {noformat} You can clearly see the type and ssid fields were swapped. POJO classes: {code:java} public static class Topology { private String writeKey; private List nodes; public Topology() { } public String getWriteKey() { return writeKey; } public void setWriteKey(String writeKey) { this.writeKey = writeKey; } public List getNodes() { return nodes; } public void setNodes(List nodes) { this.nodes = nodes; } @Override public String toString() { return "Topology{" + "writeKey='" + writeKey + '\'' + ", nodes=" + nodes + '}'; } } public static class Node { private String id; private String type; private String ssid; public Node() { } public String getId() { return id; } public void setId(String id) { this.id = id; } public String getType() { return type; } public void setType(String type) { this.type = type; } public String getSsid() { return ssid; } public void setSsid(String ssid) { this.ssid = ssid; } @Override public String toString() { return "Node{" + "id='" + id + '\'' + ", type='" + type + '\'' + ", ssid='" + ssid + '\'' + '}'; } } {code} > Java encoders - switch fields on collectAsList > -- > > Key: SPARK-21402 > URL: https://issues.apache.org/jira/browse/SPARK-21402 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: mac os > spark 2.1.1 > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121 >Reporter: Tom > > I have the following schema in a dataset - > root > |-- userId: string (nullable = true) > |-- data: map (nullable = true) > ||-- key: string > ||-- value: struct (valueContainsNull = true) > |||-- startTime: long (nullable = true) > |||-- endTime: long (nullable = true) > |-- offset: long (nullable = true) > And I have the following classes (+ setter and getters which I omitted for > simplicity) - > > {code:java} > public class MyClass { > private String userId; > private Mapdata; > private Long offset; > } >
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121284#comment-16121284 ] Peng Meng commented on SPARK-21688: --- MKL is just an example of native BLAS, if user has Openblas, ATLAS, an so on. It also works. > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, > native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121270#comment-16121270 ] Sean Owen commented on SPARK-21688: --- I guess my concern is that this slows things down unless people do make the threading configuration in the docs. I wonder if it's possible to check whether the threading env variable is set correctly and choose native BLAS only if so, for these ops? > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, > native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121254#comment-16121254 ] Vincent commented on SPARK-21688: - and if native blas is left with default multi-threading setting, it could impact other ops on JVM, as we found in native-trywait.png in attached file. > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, > native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent updated SPARK-21688: Attachment: native-trywait.png > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, > native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21684) df.write double escaping all the already escaped characters except the first one
[ https://issues.apache.org/jira/browse/SPARK-21684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taran Saini updated SPARK-21684: Attachment: SparkQuotesTest2.scala PFA the same. > df.write double escaping all the already escaped characters except the first > one > > > Key: SPARK-21684 > URL: https://issues.apache.org/jira/browse/SPARK-21684 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > Attachments: SparkQuotesTest2.scala > > > Hi, > If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh > {noformat} > Then while writing it is being written as > {noformat} "ab\,cd\\,ef\\,gh" {noformat} > i.e it double escapes all the already escaped commas/delimiters but not the > first one. > This is weird behaviour considering either it should do for all or none. > If I do mention df.option("escape","") as empty then it solves this problem > but the double quotes inside the same value if any are preceded by a special > char i.e '\u00'. Why does it do so when the escape character is set as > ""(empty)? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21684) df.write double escaping all the already escaped characters except the first one
[ https://issues.apache.org/jira/browse/SPARK-21684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121244#comment-16121244 ] Liang-Chi Hsieh commented on SPARK-21684: - Would you mind provide a small codes to reproduce it? Thanks. > df.write double escaping all the already escaped characters except the first > one > > > Key: SPARK-21684 > URL: https://issues.apache.org/jira/browse/SPARK-21684 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > > Hi, > If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh > {noformat} > Then while writing it is being written as > {noformat} "ab\,cd\\,ef\\,gh" {noformat} > i.e it double escapes all the already escaped commas/delimiters but not the > first one. > This is weird behaviour considering either it should do for all or none. > If I do mention df.option("escape","") as empty then it solves this problem > but the double quotes inside the same value if any are preceded by a special > char i.e '\u00'. Why does it do so when the escape character is set as > ""(empty)? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent updated SPARK-21688: Attachment: (was: uni-test on ddot.png) > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, svm1.png, > svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent updated SPARK-21688: Attachment: ddot unitest.png > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: ddot unitest.png, mllib svm training.png, svm1.png, > svm2.png, svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121236#comment-16121236 ] Vincent commented on SPARK-21688: - upload a data we collected before, uni-test on ddot, we can see for data size greater than 100, native blas normally has advantages. But if the size is smaller than 100, f2j would be a better choice. > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: mllib svm training.png, svm1.png, svm2.png, > svm-mkl-1.png, svm-mkl-2.png, uni-test on ddot.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent updated SPARK-21688: Attachment: uni-test on ddot.png > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: mllib svm training.png, svm1.png, svm2.png, > svm-mkl-1.png, svm-mkl-2.png, uni-test on ddot.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121225#comment-16121225 ] Sean Owen commented on SPARK-21688: --- I see, so you're saying use BLAS for level 1 ops. Do we know however that the user envs will have the right threading config such that this is a performance win? how big is it? your benchmark shows gain only on huge inputs, and keep in mind most people won't have MKL. What about inputs about size 10 or 100? > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: mllib svm training.png, svm1.png, svm2.png, > svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21679) KMeans Clustering is Not Deterministic
[ https://issues.apache.org/jira/browse/SPARK-21679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21679: -- Priority: Minor (was: Major) As a general statement, it's hard to get deterministic behavior out of a distributed implementation. The order that tasks execute can sometimes matter, and it's not possible to control every RNG used by every library. It might be possible in this particular case. > KMeans Clustering is Not Deterministic > -- > > Key: SPARK-21679 > URL: https://issues.apache.org/jira/browse/SPARK-21679 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0, 2.2.0 >Reporter: Christoph Brücke >Priority: Minor > > I’m trying to figure out how to use KMeans in order to achieve reproducible > results. I have found that running the same kmeans instance on the same data, > with different partitioning will produce different clusterings. > Given a simple KMeans run with fixed seed returns different results on the > same > training data, if the training data is partitioned differently. > Consider the following example. The same KMeans clustering set up is run on > identical data. The only difference is the partitioning of the training data > (one partition vs. four partitions). > {noformat} > import org.apache.spark.sql.DataFrame > import org.apache.spark.ml.clustering.KMeans > import org.apache.spark.ml.features.VectorAssembler > // generate random data for clustering > val randomData = spark.range(1, 1000).withColumn("a", > rand(123)).withColumn("b", rand(321)) > val vecAssembler = new VectorAssembler().setInputCols(Array("a", > "b")).setOutputCol("features") > val data = vecAssembler.transform(randomData) > // instantiate KMeans with fixed seed > val kmeans = new KMeans().setK(10).setSeed(9876L) > // train the model with different partitioning > val dataWith1Partition = data.repartition(1) > println("1 Partition: " + > kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition)) > val dataWith4Partition = data.repartition(4) > println("4 Partition: " + > kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition)) > {noformat} > I get the following related cost > {noformat} > 1 Partition: 16.028212597888057 > 4 Partition: 16.14758460544976 > {noformat} > What I want to achieve is that repeated computations of the KMeans Clustering > should yield identical result on identical training data, regardless of the > partitioning. > Looking through the Spark source code, I guess the cause is the > initialization > method of KMeans which in turn uses the `takeSample` method, which does not > seem to be partition agnostic. > Is this behaviour expected? Is there anything I could do to achieve > reproducible results? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21679) KMeans Clustering is Not Deterministic
[ https://issues.apache.org/jira/browse/SPARK-21679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-21679: Issue Type: Improvement (was: Bug) > KMeans Clustering is Not Deterministic > -- > > Key: SPARK-21679 > URL: https://issues.apache.org/jira/browse/SPARK-21679 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0, 2.2.0 >Reporter: Christoph Brücke > > I’m trying to figure out how to use KMeans in order to achieve reproducible > results. I have found that running the same kmeans instance on the same data, > with different partitioning will produce different clusterings. > Given a simple KMeans run with fixed seed returns different results on the > same > training data, if the training data is partitioned differently. > Consider the following example. The same KMeans clustering set up is run on > identical data. The only difference is the partitioning of the training data > (one partition vs. four partitions). > {noformat} > import org.apache.spark.sql.DataFrame > import org.apache.spark.ml.clustering.KMeans > import org.apache.spark.ml.features.VectorAssembler > // generate random data for clustering > val randomData = spark.range(1, 1000).withColumn("a", > rand(123)).withColumn("b", rand(321)) > val vecAssembler = new VectorAssembler().setInputCols(Array("a", > "b")).setOutputCol("features") > val data = vecAssembler.transform(randomData) > // instantiate KMeans with fixed seed > val kmeans = new KMeans().setK(10).setSeed(9876L) > // train the model with different partitioning > val dataWith1Partition = data.repartition(1) > println("1 Partition: " + > kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition)) > val dataWith4Partition = data.repartition(4) > println("4 Partition: " + > kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition)) > {noformat} > I get the following related cost > {noformat} > 1 Partition: 16.028212597888057 > 4 Partition: 16.14758460544976 > {noformat} > What I want to achieve is that repeated computations of the KMeans Clustering > should yield identical result on identical training data, regardless of the > partitioning. > Looking through the Spark source code, I guess the cause is the > initialization > method of KMeans which in turn uses the `takeSample` method, which does not > seem to be partition agnostic. > Is this behaviour expected? Is there anything I could do to achieve > reproducible results? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21679) KMeans Clustering is Not Deterministic
[ https://issues.apache.org/jira/browse/SPARK-21679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121213#comment-16121213 ] Liang-Chi Hsieh commented on SPARK-21679: - Old MLlib {{org.apache.spark.mllib.clustering.KMeans}} provides a feature to set the initial starting points by {{setInitialModel}} method. I think it can provide the deterministic clustering results you need. Looks like currently {{ml.clustering.KMeans}} doesn't provide similar feature yet. > KMeans Clustering is Not Deterministic > -- > > Key: SPARK-21679 > URL: https://issues.apache.org/jira/browse/SPARK-21679 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0, 2.2.0 >Reporter: Christoph Brücke > > I’m trying to figure out how to use KMeans in order to achieve reproducible > results. I have found that running the same kmeans instance on the same data, > with different partitioning will produce different clusterings. > Given a simple KMeans run with fixed seed returns different results on the > same > training data, if the training data is partitioned differently. > Consider the following example. The same KMeans clustering set up is run on > identical data. The only difference is the partitioning of the training data > (one partition vs. four partitions). > {noformat} > import org.apache.spark.sql.DataFrame > import org.apache.spark.ml.clustering.KMeans > import org.apache.spark.ml.features.VectorAssembler > // generate random data for clustering > val randomData = spark.range(1, 1000).withColumn("a", > rand(123)).withColumn("b", rand(321)) > val vecAssembler = new VectorAssembler().setInputCols(Array("a", > "b")).setOutputCol("features") > val data = vecAssembler.transform(randomData) > // instantiate KMeans with fixed seed > val kmeans = new KMeans().setK(10).setSeed(9876L) > // train the model with different partitioning > val dataWith1Partition = data.repartition(1) > println("1 Partition: " + > kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition)) > val dataWith4Partition = data.repartition(4) > println("4 Partition: " + > kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition)) > {noformat} > I get the following related cost > {noformat} > 1 Partition: 16.028212597888057 > 4 Partition: 16.14758460544976 > {noformat} > What I want to achieve is that repeated computations of the KMeans Clustering > should yield identical result on identical training data, regardless of the > partitioning. > Looking through the Spark source code, I guess the cause is the > initialization > method of KMeans which in turn uses the `takeSample` method, which does not > seem to be partition agnostic. > Is this behaviour expected? Is there anything I could do to achieve > reproducible results? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121209#comment-16121209 ] Vincent commented on SPARK-21688: - currently, there are certain places in ML/MLLib, such as in mllib/SVM, blas operations (dot, axpy, etc..)are bound with f2j, there is no chance to use native blas. We understand it was due to performance issue for blas level I api to go with F2J, but that's mainly because multi-thread native blas issue, with proper settings, we wont be bothered with such issue. So, maybe we should change the f2j-binding calls in the current implementation. [~srowen] > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: mllib svm training.png, svm1.png, svm2.png, > svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21686) spark.sql.hive.convertMetastoreOrc is causing NullPointerException while reading ORC tables
[ https://issues.apache.org/jira/browse/SPARK-21686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121187#comment-16121187 ] Liang-Chi Hsieh commented on SPARK-21686: - I saw the affect version is 1.6.1. So the more recent versions like 2.0 or 2.2 don't be affected by this? > spark.sql.hive.convertMetastoreOrc is causing NullPointerException while > reading ORC tables > --- > > Key: SPARK-21686 > URL: https://issues.apache.org/jira/browse/SPARK-21686 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.1 > Environment: spark_2_4_2_0_258-1.6.1.2.4.2.0-258.el6.noarch > spark_2_4_2_0_258-python-1.6.1.2.4.2.0-258.el6.noarch > spark_2_4_2_0_258-yarn-shuffle-1.6.1.2.4.2.0-258.el6.noarch > RHEL-7 (64-Bit) > JDK 1.8 >Reporter: Ernani Pereira de Mattos Junior > > The issue is very similar to SPARK-10304; > Spark Query throws a NullPointerException. > >>> sqlContext.sql('select * from core_next.spark_categorization').show(57) > 17/06/19 11:26:54 ERROR Executor: Exception in task 2.0 in stage 21.0 (TID > 48) > java.lang.NullPointerException > at > org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:488) > > at > org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:244) > > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$6.apply(OrcRelation.scala:275) > > at > org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$6.apply(OrcRelation.scala:275) > > Turn off ORC optimizations and issue was resolved: > "sqlContext.setConf("spark.sql.hive.convertMetastoreOrc", "false") -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization
[ https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121175#comment-16121175 ] Peng Meng commented on SPARK-21680: --- I mean if the user call toSparse(size), but the size is smaller than numNonZero, there maybe problem. > ML/MLLIB Vector compressed optimization > --- > > Key: SPARK-21680 > URL: https://issues.apache.org/jira/browse/SPARK-21680 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > When use Vector.compressed to change a Vector to SparseVector, the > performance is very low comparing with Vector.toSparse. > This is because you have to scan the value three times using > Vector.compressed, but you just need two times when use Vector.toSparse. > When the length of the vector is large, there is significant performance > difference between this two method. > Code of Vector compressed: > {code:java} > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > toSparse > } else { > toDense > } > } > {code} > I propose to change it to: > {code:java} > // Some comments here > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > val ii = new Array[Int](nnz) > val vv = new Array[Double](nnz) > var k = 0 > foreachActive { (i, v) => > if (v != 0) { > ii(k) = i > vv(k) = v > k += 1 > } > } > new SparseVector(size, ii, vv) > } else { > toDense > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization
[ https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121166#comment-16121166 ] Sean Owen commented on SPARK-21680: --- I don't get what security issue you mean here, but no the change you proposed initially is not a good solution. > ML/MLLIB Vector compressed optimization > --- > > Key: SPARK-21680 > URL: https://issues.apache.org/jira/browse/SPARK-21680 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Peng Meng > > When use Vector.compressed to change a Vector to SparseVector, the > performance is very low comparing with Vector.toSparse. > This is because you have to scan the value three times using > Vector.compressed, but you just need two times when use Vector.toSparse. > When the length of the vector is large, there is significant performance > difference between this two method. > Code of Vector compressed: > {code:java} > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > toSparse > } else { > toDense > } > } > {code} > I propose to change it to: > {code:java} > // Some comments here > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > val ii = new Array[Int](nnz) > val vv = new Array[Double](nnz) > var k = 0 > foreachActive { (i, v) => > if (v != 0) { > ii(k) = i > vv(k) = v > k += 1 > } > } > new SparseVector(size, ii, vv) > } else { > toDense > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121163#comment-16121163 ] Sean Owen commented on SPARK-21688: --- Of course native BLAS is typically faster where it is used so you should enable it. What are you specifically saying beyond that? > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: mllib svm training.png, svm1.png, svm2.png, > svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21689) Spark submit will not get kerberos token token when hbase class not found
[ https://issues.apache.org/jira/browse/SPARK-21689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121124#comment-16121124 ] zhoukang commented on SPARK-21689: -- https://github.com/apache/spark/pull/18901 i created a pr for this issue.And i wonder why the pr did not related to this issue automatically? > Spark submit will not get kerberos token token when hbase class not found > - > > Key: SPARK-21689 > URL: https://issues.apache.org/jira/browse/SPARK-21689 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.1.0, 2.2.0 >Reporter: zhoukang > > When use yarn cluster mode,and we need scan hbase,there will be a case which > can not work: > If we put user jar on hdfs,when local classpath will has no hbase,which will > let get hbase token failed.Then later when job submitted to yarn, it will > failed since has no token to access hbase table.I mock three cases: > 1:user jar is on classpath, and has hbase > {code:java} > 17/08/10 13:48:03 INFO security.HadoopFSDelegationTokenProvider: Renewal > interval is 86400050 for token HDFS_DELEGATION_TOKEN > 17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hive > 17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hbase > 17/08/10 13:48:05 INFO security.HBaseDelegationTokenProvider: Attempting to > fetch HBase security token. > {code} > Logs showing we can get token normally. > 2:user jar on hdfs > {code:java} > 17/08/10 13:43:58 WARN security.HBaseDelegationTokenProvider: Class > org.apache.hadoop.hbase.HBaseConfiguration not found. > 17/08/10 13:43:58 INFO security.HBaseDelegationTokenProvider: Failed to get > token from service hbase > java.lang.ClassNotFoundException: > org.apache.hadoop.hbase.security.token.TokenUtil > at java.net.URLClassLoader$1.run(URLClassLoader.java:372) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:360) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:41) > at > org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:112) > at > org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:109) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > {code} > Logs showing we can get token failed with ClassNotFoundException. > If we download user jar from remote first,then things will work correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21660) Yarn ShuffleService failed to start when the chosen directory become read-only
[ https://issues.apache.org/jira/browse/SPARK-21660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121125#comment-16121125 ] Saisai Shao commented on SPARK-21660: - Will yarn NM handle this bad disk problem and return a good disk for recoveryPath? I guess yarn should handle this problem. > Yarn ShuffleService failed to start when the chosen directory become read-only > -- > > Key: SPARK-21660 > URL: https://issues.apache.org/jira/browse/SPARK-21660 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Affects Versions: 2.1.1 >Reporter: lishuming > > h3. Background > In our production environment,disks corrupt to `read-only` status almost once > a month. Now the strategy of Yarn ShuffleService which chooses an available > directory(disk) to store Shuffle info(DB) is as > below(https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L340): > 1. If NameNode's recoveryPath not empty and shuffle DB exists in the > recoveryPath, return the recoveryPath; > 2. If recoveryPath empty and shuffle DB exists in > `yarn.nodemanager.local-dirs`, set recoveryPath as the existing DB path and > return the path; > 3. If recoveryPath not empty(shuffle DB not exists in the path) and shuffle > DB exists in `yarn.nodemanager.local-dirs`, mv the existing shuffle DB to > recoveryPath and return the path; > 4. If all above don't hit, we choose the first disk of > `yarn.nodemanager.local-dirs`as the recoveryPath; > All above strategy don't consider the chosen disk(directory) is writable or > not, so in our environment we meet such exception: > {code:java} > 2017-06-25 07:15:43,512 ERROR org.apache.spark.network.util.LevelDBProvider: > error opening leveldb file /mnt/dfs/12/yarn/local/registeredExecutors.ldb. > Creating new file, will not be able to recover state for existing applications > at > org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:48) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66) > at > org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167) > 2017-06-25 07:15:43,514 WARN org.apache.spark.network.util.LevelDBProvider: > error deleting /mnt/dfs/12/yarn/local/registeredExecutors.ldb > 2017-06-25 07:15:43,515 INFO org.apache.hadoop.service.AbstractService: > Service spark_shuffle failed in state INITED; cause: java.io.IOException: > Unable to create state store > at > org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:77) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66) > at > org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167) > at > org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:75) > {code} > h3. Consideration > 1. For many production environment, `yarn.nodemanager.local-dirs` always has > more than 1 disk, so we can make a better chosen strategy to avoid the > problem above; > 2. Can we add a strategy to check the DB directory we choose is writable, so > avoid the problem above? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121113#comment-16121113 ] Vincent edited comment on SPARK-21688 at 8/10/17 6:13 AM: -- attach svm profiling data and training comparison data for both F2J and MKL solution was (Author: vincexie): profiling > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: mllib svm training.png, svm1.png, svm2.png, > svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS
[ https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent updated SPARK-21688: Attachment: svm1.png svm2.png svm-mkl-1.png svm-mkl-2.png profiling > performance improvement in mllib SVM with native BLAS > -- > > Key: SPARK-21688 > URL: https://issues.apache.org/jira/browse/SPARK-21688 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.2.0 > Environment: 4 nodes: 1 master node, 3 worker nodes > model name : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz > Memory : 180G > num of core per node: 10 >Reporter: Vincent > Attachments: mllib svm training.png, svm1.png, svm2.png, > svm-mkl-1.png, svm-mkl-2.png > > > in current mllib SVM implementation, we found that the CPU is not fully > utilized, one reason is that f2j blas is set to be used in the HingeGradient > computation. As we found out earlier > (https://issues.apache.org/jira/browse/SPARK-21305) that with proper > settings, native blas is generally better than f2j on the uni-test level, > here we make the blas operations in SVM go with MKL blas and get an end to > end performance report showing that in most cases native blas outperformance > f2j blas up to 50%. > So, we suggest removing those f2j-fixed calling and going for native blas if > available. If this proposal is acceptable, we will move on to benchmark other > algorithms impacted. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21689) Spark submit will not get kerberos token token when hbase class not found
[ https://issues.apache.org/jira/browse/SPARK-21689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-21689: - Description: When use yarn cluster mode,and we need scan hbase,there will be a case which can not work: If we put user jar on hdfs,when local classpath will has no hbase,which will let get hbase token failed.Then later when job submitted to yarn, it will failed since has no token to access hbase table.I mock three cases: 1:user jar is on classpath, and has hbase {code:java} 17/08/10 13:48:03 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 86400050 for token HDFS_DELEGATION_TOKEN 17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hive 17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hbase 17/08/10 13:48:05 INFO security.HBaseDelegationTokenProvider: Attempting to fetch HBase security token. {code} Logs showing we can get token normally. 2:user jar on hdfs {code:java} 17/08/10 13:43:58 WARN security.HBaseDelegationTokenProvider: Class org.apache.hadoop.hbase.HBaseConfiguration not found. 17/08/10 13:43:58 INFO security.HBaseDelegationTokenProvider: Failed to get token from service hbase java.lang.ClassNotFoundException: org.apache.hadoop.hbase.security.token.TokenUtil at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:41) at org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:112) at org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:109) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) {code} Logs showing we can get token failed with ClassNotFoundException. If we download user jar from remote first,then things will work correctly. was: When use yarn cluster mode,and we need scan hbase,there will be a case which can not work: If we put user jar on hdfs,when local classpath will has no hbase,which will let get hbase token failed.Then later when job submitted to yarn, it will failed since has no token to access hbase table.I mock three cases: 1:user jar is on classpath, and has hbase {code:java} 17/08/10 13:48:03 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 86400050 for token HDFS_DELEGATION_TOKEN 17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hive 17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hbase 17/08/10 13:48:05 INFO security.HBaseDelegationTokenProvider: Attempting to fetch HBase security token. {code} 2:user jar on hdfs {code:java} 17/08/10 13:43:58 WARN security.HBaseDelegationTokenProvider: Class org.apache.hadoop.hbase.HBaseConfiguration not found. 17/08/10 13:43:58 INFO security.HBaseDelegationTokenProvider: Failed to get token from service hbase java.lang.ClassNotFoundException: org.apache.hadoop.hbase.security.token.TokenUtil at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:41) at org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:112) at org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:109) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) {code} If we download user jar from remote first,then things will work correctly. > Spark submit will not get kerberos token token when hbase class not found > - > > Key: SPARK-21689 > URL:
[jira] [Created] (SPARK-21690) one-pass imputer
zhengruifeng created SPARK-21690: Summary: one-pass imputer Key: SPARK-21690 URL: https://issues.apache.org/jira/browse/SPARK-21690 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.1 Reporter: zhengruifeng {code} val surrogates = $(inputCols).map { inputCol => val ic = col(inputCol) val filtered = dataset.select(ic.cast(DoubleType)) .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN) if(filtered.take(1).length == 0) { throw new SparkException(s"surrogate cannot be computed. " + s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})") } val surrogate = $(strategy) match { case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first() case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head } surrogate } {code} Current impl of {{Imputer}} process one column after after another. In this place, we should parallelize the processing in a more efficient way. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org