[jira] [Updated] (SPARK-21703) Why RPC message are transferred with header and body separately in TCP frame

2017-08-10 Thread neoremind (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

neoremind updated SPARK-21703:
--
Description: 
After seeing the details of how spark leverage netty, I found one question, 
typically RPC message wire format would have a header+payload structure, and 
netty uses a TransportFrameDecoder to deal with how to determine a complete 
message from remote peer. But after using Wireshark sniffing tool, I found that 
the message are sent separately with header and then followed by body, although 
this works fine, but for underlying TCP there would be ACK segments sent back 
to acknowledge, there might be a little bit redundancy since we can sent them 
together and the header are usually very small. 

The main reason can be found in MessageWithHeader class, since transferTo 
method write tow times for header and body.

Could someone help me understand the background story on how to implement in 
such way?  Thanks!

  was:
After seeing the details of how spark leverage netty, I found one question, 
typically RPC message wire format would have a header+payload structure, and 
netty uses a TransportFrameDecoder to deal with how to determine a complete 
message from remote peer. But after using Wireshark sniffing tool, I found that 
the message are sent separately with header and then followed by body, although 
this works fine, but for underlying TCP there would be ACK segments sent back 
to acknowledge, there might be a little bit redundancy since we can sent them 
together and the header are usually very small. 

The main reason can be found in MessageWithHeader class, since transferTo 
method write tow times for header and body.

Could some one help me understand the background story on how to implement in 
such way?  Thanks!


> Why RPC message are transferred with header and body separately in TCP frame
> 
>
> Key: SPARK-21703
> URL: https://issues.apache.org/jira/browse/SPARK-21703
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: neoremind
>Priority: Trivial
>  Labels: RPC
>
> After seeing the details of how spark leverage netty, I found one question, 
> typically RPC message wire format would have a header+payload structure, and 
> netty uses a TransportFrameDecoder to deal with how to determine a complete 
> message from remote peer. But after using Wireshark sniffing tool, I found 
> that the message are sent separately with header and then followed by body, 
> although this works fine, but for underlying TCP there would be ACK segments 
> sent back to acknowledge, there might be a little bit redundancy since we can 
> sent them together and the header are usually very small. 
> The main reason can be found in MessageWithHeader class, since transferTo 
> method write tow times for header and body.
> Could someone help me understand the background story on how to implement in 
> such way?  Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21703) Why RPC message are transferred with header and body separately in TCP frame

2017-08-10 Thread neoremind (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

neoremind updated SPARK-21703:
--
Description: 
After seeing the details of how spark leverage netty, I found one question, 
typically RPC message wire format would have a header+payload structure, and 
netty uses a TransportFrameDecoder to deal with how to determine a complete 
message from remote peer. But after using Wireshark sniffing tool, I found that 
the message are sent separately with header and then followed by body, although 
this works fine, but for underlying TCP there would be ACK segments sent back 
to acknowledge, there might be a little bit redundancy since we can sent them 
together and the header are usually very small. 

The main reason can be found in MessageWithHeader class, since transferTo 
method write tow times for header and body.

Could someone help me understand the background story on why to implement in 
such way?  Thanks!

  was:
After seeing the details of how spark leverage netty, I found one question, 
typically RPC message wire format would have a header+payload structure, and 
netty uses a TransportFrameDecoder to deal with how to determine a complete 
message from remote peer. But after using Wireshark sniffing tool, I found that 
the message are sent separately with header and then followed by body, although 
this works fine, but for underlying TCP there would be ACK segments sent back 
to acknowledge, there might be a little bit redundancy since we can sent them 
together and the header are usually very small. 

The main reason can be found in MessageWithHeader class, since transferTo 
method write tow times for header and body.

Could someone help me understand the background story on how to implement in 
such way?  Thanks!


> Why RPC message are transferred with header and body separately in TCP frame
> 
>
> Key: SPARK-21703
> URL: https://issues.apache.org/jira/browse/SPARK-21703
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: neoremind
>Priority: Trivial
>  Labels: RPC
>
> After seeing the details of how spark leverage netty, I found one question, 
> typically RPC message wire format would have a header+payload structure, and 
> netty uses a TransportFrameDecoder to deal with how to determine a complete 
> message from remote peer. But after using Wireshark sniffing tool, I found 
> that the message are sent separately with header and then followed by body, 
> although this works fine, but for underlying TCP there would be ACK segments 
> sent back to acknowledge, there might be a little bit redundancy since we can 
> sent them together and the header are usually very small. 
> The main reason can be found in MessageWithHeader class, since transferTo 
> method write tow times for header and body.
> Could someone help me understand the background story on why to implement in 
> such way?  Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21703) Why RPC message are transferred with header and body separately in TCP frame

2017-08-10 Thread neoremind (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

neoremind updated SPARK-21703:
--
Description: 
After seeing the details of how spark leverage netty, I found one question, 
typically RPC message wire format would have a header+payload structure, and 
netty uses a TransportFrameDecoder to deal with how to determine a complete 
message from remote peer. But after using Wireshark sniffing tool, I found that 
the message are sent separately with header and then followed by body, although 
this works fine, but for underlying TCP there would be ACK segments sent back 
to acknowledge, there might be a little bit redundancy since we can sent them 
together and the header are usually very small. 

The main reason can be found in MessageWithHeader class, since transferTo 
method write tow times for header and body.

Could some one help me understand the background story on how to implement in 
such way?  Thanks!

  was:
After seeing the details of how spark leverage netty, I found one question, 
typically RPC message wire format would have a header+payload structure, and 
netty uses a TransportFrameDecoder to deal with how to determine a complete 
message from remote peer. But after using Wireshark sniffing tool, I found that 
the message are sent separately with header and then a body, although this 
works fine, but for underlying TCP there would be ACK segments sent back to 
acknowledge, there might be a little bit redundancy since we can sent them 
together and the header are usually very small. 

The main reason can be found in MessageWithHeader class, since transferTo 
method write tow times for header and body.

Could some one help me understand the background story on how to implement in 
such way?  Thanks!


> Why RPC message are transferred with header and body separately in TCP frame
> 
>
> Key: SPARK-21703
> URL: https://issues.apache.org/jira/browse/SPARK-21703
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: neoremind
>Priority: Trivial
>  Labels: RPC
>
> After seeing the details of how spark leverage netty, I found one question, 
> typically RPC message wire format would have a header+payload structure, and 
> netty uses a TransportFrameDecoder to deal with how to determine a complete 
> message from remote peer. But after using Wireshark sniffing tool, I found 
> that the message are sent separately with header and then followed by body, 
> although this works fine, but for underlying TCP there would be ACK segments 
> sent back to acknowledge, there might be a little bit redundancy since we can 
> sent them together and the header are usually very small. 
> The main reason can be found in MessageWithHeader class, since transferTo 
> method write tow times for header and body.
> Could some one help me understand the background story on how to implement in 
> such way?  Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21703) Why RPC message are transferred with header and body separately in TCP frame

2017-08-10 Thread neoremind (JIRA)
neoremind created SPARK-21703:
-

 Summary: Why RPC message are transferred with header and body 
separately in TCP frame
 Key: SPARK-21703
 URL: https://issues.apache.org/jira/browse/SPARK-21703
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: neoremind
Priority: Trivial


After seeing the details of how spark leverage netty, I found one question, 
typically RPC message wire format would have a header+payload structure, and 
netty uses a TransportFrameDecoder to deal with how to determine a complete 
message from remote peer. But after using Wireshark sniffing tool, I found that 
the message are sent separately with header and then a body, although this 
works fine, but for underlying TCP there would be ACK segments sent back to 
acknowledge, there might be a little bit redundancy since we can sent them 
together and the header are usually very small. 

The main reason can be found in MessageWithHeader class, since transferTo 
method write tow times for header and body.

Could some one help me understand the background story on how to implement in 
such way?  Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used

2017-08-10 Thread George Pongracz (JIRA)
George Pongracz created SPARK-21702:
---

 Summary: Structured Streaming S3A SSE Encryption Not Applied when 
PartitionBy Used
 Key: SPARK-21702
 URL: https://issues.apache.org/jira/browse/SPARK-21702
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
 Environment: Hadoop 2.7.3: AWS SDK 1.7.4
Hadoop 2.8.1: AWS SDK 1.10.6

Reporter: George Pongracz
Priority: Minor


Settings:
  .config("spark.hadoop.fs.s3a.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem")
  .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", "AES256")

When writing to an S3 sink from structured streaming the files are being 
encrypted using AES-256

When introducing a "PartitionBy" the output data files are unencrypted. 

All other supporting files, metadata are encrypted

Suspect write to temp is encrypted and move/rename is not applying the SSE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client

2017-08-10 Thread neoremind (JIRA)
neoremind created SPARK-21701:
-

 Summary: Add TCP send/rcv buffer size support for RPC client
 Key: SPARK-21701
 URL: https://issues.apache.org/jira/browse/SPARK-21701
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: neoremind
Priority: Trivial


For TransportClientFactory class, there are no params derived from SparkConf to 
set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. Increasing 
the receive buffer size can increase the I/O performance for high-volume 
transport.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21700) How can I get the MetricsSystem information

2017-08-10 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122762#comment-16122762
 ] 

Alex Bozarth commented on SPARK-21700:
--

I would recommend taking a look at the Metrics REST API 
(https://spark.apache.org/docs/latest/monitoring.html#rest-api). Also in the 
future the email list would probably be a better place for questions like these.

> How can I get the MetricsSystem information
> ---
>
> Key: SPARK-21700
> URL: https://issues.apache.org/jira/browse/SPARK-21700
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: LiuXiangyu
>
> I want to get the information that shows on the spark Web UI, but I don't 
> want to write a spider to get those information from the website, and I know 
> those information are come from MetricsSystem, is there any way that I can 
> use the MetricsSystem in my program to get those metrics information?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-08-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21693:
-
Description: 
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in Spark. I asked this for my account few times before but it 
looks we can't increase this time limit again and again.

I could identify two things that look taking a quite a bit of time:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (roughly 10ish mins)


https://www.appveyor.com/docs/build-cache/

{quote}
Note: Saving cache is disabled in Pull Request builds.
{quote}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).

I am also checking and testing other ways.


  was:
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in Spark. I asked this for my account few times before but it 
looks we can't increase this time limit again and again.

I could identify two things that look taking a quite a bit of time:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (10-20ish mins)


https://www.appveyor.com/docs/build-cache/

{quote}
Note: Saving cache is disabled in Pull Request builds.
{quote}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).

I am also checking and testing other ways.



> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in Spark. I asked this for my account few times before but it 
> looks we can't increase this time limit again and again.
> I could identify two things that look taking a quite a bit of time:
> 1. Disabled cache feature in pull request builder, which ends up downloading 
> Maven dependencies (roughly 10ish mins)
> https://www.appveyor.com/docs/build-cache/
> {quote}
> Note: Saving cache is disabled in Pull Request builds.
> {quote}
> and also see 
> http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working
> This seems difficult to fix within Spark.
> 2. "MLlib classification algorithms" tests (30-35ish mins)
> This test below looks taking 30-35ish mins.
> {code}
> MLlib classification algorithms, except for tree-based algorithms: Spark 
> package found in SPARK_HOME: C:\projects\spark\bin\..
> ..
> {code}
> As a (I think) last resort, we could make a matrix for this test alone, so 
> that we run the other tests after a build and then run this test after 
> another build, for example, I run Scala tests by this workaround - 
> 

[jira] [Created] (SPARK-21700) How can I get the MetricsSystem information

2017-08-10 Thread LiuXiangyu (JIRA)
LiuXiangyu created SPARK-21700:
--

 Summary: How can I get the MetricsSystem information
 Key: SPARK-21700
 URL: https://issues.apache.org/jira/browse/SPARK-21700
 Project: Spark
  Issue Type: Question
  Components: Web UI
Affects Versions: 1.6.0
Reporter: LiuXiangyu


I want to get the information that shows on the spark Web UI, but I don't want 
to write a spider to get those information from the website, and I know those 
information are come from MetricsSystem, is there any way that I can use the 
MetricsSystem in my program to get those metrics information?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21699) Remove unused getTableOption in ExternalCatalog

2017-08-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-21699:

Fix Version/s: 2.2.1

> Remove unused getTableOption in ExternalCatalog
> ---
>
> Key: SPARK-21699
> URL: https://issues.apache.org/jira/browse/SPARK-21699
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21699) Remove unused getTableOption in ExternalCatalog

2017-08-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-21699.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Remove unused getTableOption in ExternalCatalog
> ---
>
> Key: SPARK-21699
> URL: https://issues.apache.org/jira/browse/SPARK-21699
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21564) TaskDescription decoding failure should fail the task

2017-08-10 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122651#comment-16122651
 ] 

Andrew Ash commented on SPARK-21564:


[~irashid] a possible fix could look roughly like this:

{noformat}
diff --git 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
index a2f1aa22b0..06d72fe106 100644
--- 
a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
@@ -17,6 +17,7 @@

 package org.apache.spark.executor

+import java.io.{DataInputStream, NotSerializableException}
 import java.net.URL
 import java.nio.ByteBuffer
 import java.util.Locale
@@ -35,7 +36,7 @@ import org.apache.spark.rpc._
 import org.apache.spark.scheduler.{ExecutorLossReason, TaskDescription}
 import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages._
 import org.apache.spark.serializer.SerializerInstance
-import org.apache.spark.util.{ThreadUtils, Utils}
+import org.apache.spark.util.{ByteBufferInputStream, ThreadUtils, Utils}

 private[spark] class CoarseGrainedExecutorBackend(
 override val rpcEnv: RpcEnv,
@@ -93,9 +94,26 @@ private[spark] class CoarseGrainedExecutorBackend(
   if (executor == null) {
 exitExecutor(1, "Received LaunchTask command but executor was null")
   } else {
-val taskDesc = TaskDescription.decode(data.value)
-logInfo("Got assigned task " + taskDesc.taskId)
-executor.launchTask(this, taskDesc)
+try {
+  val taskDesc = TaskDescription.decode(data.value)
+  logInfo("Got assigned task " + taskDesc.taskId)
+  executor.launchTask(this, taskDesc)
+} catch {
+  case e: Exception =>
+val taskId = new DataInputStream(new ByteBufferInputStream(
+  ByteBuffer.wrap(data.value.array(.readLong()
+val ser = env.closureSerializer.newInstance()
+val serializedTaskEndReason = {
+  try {
+ser.serialize(new ExceptionFailure(e, Nil))
+  } catch {
+case _: NotSerializableException =>
+  // e is not serializable so just send the stacktrace
+  ser.serialize(new ExceptionFailure(e, Nil, false))
+  }
+}
+statusUpdate(taskId, TaskState.FAILED, serializedTaskEndReason)
+}
   }

 case KillTask(taskId, _, interruptThread, reason) =>
{noformat}

The downside here though is that we're still making the assumption that the 
TaskDescription is well-formatted enough to be able to get the taskId out of it 
(the first long in the serialized bytes).

Any other thoughts on how to do this?

> TaskDescription decoding failure should fail the task
> -
>
> Key: SPARK-21564
> URL: https://issues.apache.org/jira/browse/SPARK-21564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> cc [~robert3005]
> I was seeing an issue where Spark was throwing this exception:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - Ignoring error
> java.io.EOFException: null
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readUTF(DataInputStream.java:609)
> at java.io.DataInputStream.readUTF(DataInputStream.java:564)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
> at scala.collection.immutable.Range.foreach(Range.scala:160)
> at 
> org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> For details on the cause of that exception, see SPARK-21563
> We've since changed the application and have a proposed fix in Spark at the 
> ticket above, but it was troubling that decoding the 

[jira] [Commented] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-08-10 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122549#comment-16122549
 ] 

Andrew Ash commented on SPARK-21563:


Thanks for the thoughts [~irashid] -- I submitted a PR implementing this 
approach at https://github.com/apache/spark/pull/18913

> Race condition when serializing TaskDescriptions and adding jars
> 
>
> Key: SPARK-21563
> URL: https://issues.apache.org/jira/browse/SPARK-21563
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> cc [~robert3005]
> I was seeing this exception during some running Spark jobs:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - Ignoring error
> java.io.EOFException: null
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readUTF(DataInputStream.java:609)
> at java.io.DataInputStream.readUTF(DataInputStream.java:564)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
> at scala.collection.immutable.Range.foreach(Range.scala:160)
> at 
> org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> After some debugging, we determined that this is due to a race condition in 
> task serde.  cc [~irashid] [~kayousterhout] who last touched that code in 
> SPARK-19796
> The race is between adding additional jars to the SparkContext and 
> serializing the TaskDescription.
> Consider this sequence of events:
> - TaskSetManager creates a TaskDescription using a reference to the 
> SparkContext's jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
> - TaskDescription starts serializing, and begins writing jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
> - the size of the jar map is written out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
> - _on another thread_: the application adds a jar to the SparkContext's jars 
> list
> - then the entries in the jars list are serialized out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64
> The problem now is that the jars list is serialized as having N entries, but 
> actually N+1 entries follow that count!
> This causes task deserialization to fail in the executor, with the stacktrace 
> above.
> The same issue also likely exists for files, though I haven't observed that 
> and our application does not stress that codepath the same way it did for jar 
> additions.
> One fix here is that TaskSetManager could make an immutable copy of the jars 
> list that it passes into the TaskDescription constructor, so that list 
> doesn't change mid-serialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21699) Remove unused getTableOption in ExternalCatalog

2017-08-10 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-21699:
---

 Summary: Remove unused getTableOption in ExternalCatalog
 Key: SPARK-21699
 URL: https://issues.apache.org/jira/browse/SPARK-21699
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-08-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21693:
-
Description: 
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in Spark. I asked this for my account few times before but it 
looks we can't increase this time limit again and again.

I could identify two things that look taking a quite a bit of time:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (10-20ish mins)


https://www.appveyor.com/docs/build-cache/

{quote}
Note: Saving cache is disabled in Pull Request builds.
{quote}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).

I am also checking and testing other ways.


  was:
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in AppVeyor. I asked this for my account few times before but 
it looks we can't increase this time limit again and again.

I could identify two things that look taking a quite a bit of time:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (10-20ish mins)


https://www.appveyor.com/docs/build-cache/

{quote}
Note: Saving cache is disabled in Pull Request builds.
{quote}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).

I am also checking and testing other ways.



> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in Spark. I asked this for my account few times before but it 
> looks we can't increase this time limit again and again.
> I could identify two things that look taking a quite a bit of time:
> 1. Disabled cache feature in pull request builder, which ends up downloading 
> Maven dependencies (10-20ish mins)
> https://www.appveyor.com/docs/build-cache/
> {quote}
> Note: Saving cache is disabled in Pull Request builds.
> {quote}
> and also see 
> http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working
> This seems difficult to fix within Spark.
> 2. "MLlib classification algorithms" tests (30-35ish mins)
> This test below looks taking 30-35ish mins.
> {code}
> MLlib classification algorithms, except for tree-based algorithms: Spark 
> package found in SPARK_HOME: C:\projects\spark\bin\..
> ..
> {code}
> As a (I think) last resort, we could make a matrix for this test alone, so 
> that we run the other tests after a build and then run this test after 
> another build, for example, I run Scala tests by this workaround - 
> 

[jira] [Commented] (SPARK-21685) Params isSet in scala Transformer triggered by _setDefault in pyspark

2017-08-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122492#comment-16122492
 ] 

Joseph K. Bradley commented on SPARK-21685:
---

Could you please point to more info, such as the Python wrappers you are 
calling?  I don't see enough info here to identify the problem.

> Params isSet in scala Transformer triggered by _setDefault in pyspark
> -
>
> Key: SPARK-21685
> URL: https://issues.apache.org/jira/browse/SPARK-21685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Ratan Rai Sur
>
> I'm trying to write a PySpark wrapper for a Transformer whose transform 
> method includes the line
> {code:java}
> require(!(isSet(outputNodeName) && isSet(outputNodeIndex)), "Can't set both 
> outputNodeName and outputNodeIndex")
> {code}
> This should only throw an exception when both of these parameters are 
> explicitly set.
> In the PySpark wrapper for the Transformer, there is this line in ___init___
> {code:java}
> self._setDefault(outputNodeIndex=0)
> {code}
> Here is the line in the main python script showing how it is being configured
> {code:java}
> cntkModel = 
> CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(spark,
>  model.uri).setOutputNodeName("z")
> {code}
> As you can see, only setOutputNodeName is being explicitly set but the 
> exception is still being thrown.
> If you need more context, 
> https://github.com/RatanRSur/mmlspark/tree/default-cntkmodel-output is the 
> branch with the code, the files I'm referring to here that are tracked are 
> the following:
> src/cntk-model/src/main/scala/CNTKModel.scala
> notebooks/tests/301 - CIFAR10 CNTK CNN Evaluation.ipynb
> The pyspark wrapper code is autogenerated



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data

2017-08-10 Thread Luis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis updated SPARK-21698:
-
Description: 
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.




{code:title=Program Output|borderStyle=solid}
17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test
/usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring 
schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
+---++-+

17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

+---++-+
| id|name|count|
+---++-+
|  9|   4| null|
| 10|   6| null|
|  7|   1| null|
|  8|   2| null|
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+


{code}

In the last show(). I see the data is null 

{code:title=spark init|borderStyle=solid}
self.spark = SparkSession \
.builder \
.master("spark://localhost:7077") \
.enableHiveSupport() \
.getOrCreate()


{code}

{code:title=Code for the test case|borderStyle=solid}
def test_clean_insert_table(self):
table_name = "data"
data0 = [
{"id": 1, "name":"1", "count": 1},
{"id": 2, "name":"2", "count": 2},
{"id": 3, "name":"3", "count": 3},
]
df_data0 = self.spark.createDataFrame(data0)

df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data1 = [
{"id": 4, "name":"4", "count": 4},
{"id": 5, "name":"5", "count": 5},
{"id": 6, "name":"6", "count": 6},
]
df_data1 = self.spark.createDataFrame(data1)
df_data1.write.insertInto(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data3 = [
{"id": 1, "name":"one", "count":7},
{"id": 2, "name":"two", "count": 8},
{"id": 4, "name":"three", "count": 9},
{"id": 6, "name":"six", "count":10}
]
data3 = self.spark.createDataFrame(data3)
data3.write.insertInto(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()
{code}







  was:
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.




{code:title=Program Output|borderStyle=solid}
17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test
/usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring 
schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
+---++-+

17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  4|   4|4|
|  5|   5|5|
|  5|   5|5|
|  6|   6|6|
|  6|   6|6|
+---++-+

{code}

In the last show(). I see the data isn't what I would expect.

{code:title=spark init|borderStyle=solid}
self.spark = SparkSession \
.builder \
.master("spark://localhost:7077") \
.enableHiveSupport() \
.getOrCreate()


{code}

{code:title=Code for the test case|borderStyle=solid}
def test_clean_insert_table(self):
table_name = "data"
data0 = [
{"id": 1, 

[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data

2017-08-10 Thread Luis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis updated SPARK-21698:
-
Description: 
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.




{code:title=Program Output|borderStyle=solid}
17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test
/usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring 
schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
+---++-+

17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  4|   4|4|
|  5|   5|5|
|  5|   5|5|
|  6|   6|6|
|  6|   6|6|
+---++-+

{code}

In the last show(). I see the data isn't what I would expect.

{code:title=spark init|borderStyle=solid}
self.spark = SparkSession \
.builder \
.master("spark://localhost:7077") \
.enableHiveSupport() \
.getOrCreate()


{code}

{code:title=Code for the test case|borderStyle=solid}
def test_clean_insert_table(self):
table_name = "data"
data0 = [
{"id": 1, "name":"1", "count": 1},
{"id": 2, "name":"2", "count": 2},
{"id": 3, "name":"3", "count": 3},
]
df_data0 = self.spark.createDataFrame(data0)

df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data1 = [
{"id": 4, "name":"4", "count": 4},
{"id": 5, "name":"5", "count": 5},
{"id": 6, "name":"6", "count": 6},
]
df_data1 = self.spark.createDataFrame(data1)
df_data1.write.insertInto(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data3 = [
{"id": 1, "name":"one", "count":7},
{"id": 2, "name":"two", "count": 8},
{"id": 4, "name":"three", "count": 9},
{"id": 6, "name":"six", "count":10}
]
data3 = self.spark.createDataFrame(data3)
data3.write.insertInto(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()
{code}







  was:
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.




{code:title=Program Output|borderStyle=solid}
17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test
/usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring 
schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
+---++-+

17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  4|   4|4|
|  5|   5|5|
|  5|   5|5|
|  6|   6|6|
|  6|   6|6|
+---++-+

{code}

In the last show(). I see the data isn't what I would expect.

{code:title=spark init|borderStyle=solid}
self.spark = SparkSession \
.builder \
.master("spark://localhost:7077") \
.enableHiveSupport() \
.getOrCreate()


{code}

{code:title=Code for the test case|borderStyle=solid}
def test_clean_insert_table(self):
table_name = "data"
data0 = [
{"id": 1, 

[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data

2017-08-10 Thread Luis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis updated SPARK-21698:
-
Description: 
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.




{code:title=Program Output|borderStyle=solid}
17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test
/usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring 
schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
+---++-+

17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  4|   4|4|
|  5|   5|5|
|  5|   5|5|
|  6|   6|6|
|  6|   6|6|
+---++-+

{code}

In the last show(). I see the data isn't what I would expect.

{code:title=spark init|borderStyle=solid}
self.spark = SparkSession \
.builder \
.master("spark://localhost:7077") \
.enableHiveSupport() \
.getOrCreate()


{code}

{code:title=Code for the test case|borderStyle=solid}
def test_clean_insert_table(self):
table_name = "data"
data0 = [
{"id": 1, "name":"1", "count": 1},
{"id": 2, "name":"2", "count": 2},
{"id": 3, "name":"3", "count": 3},
]
df_data0 = self.spark.createDataFrame(data0)

df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data1 = [
{"id": 4, "name":"4", "count": 4},
{"id": 5, "name":"5", "count": 5},
{"id": 6, "name":"6", "count": 6},
]
df_data1 = self.spark.createDataFrame(data1)
df_data1.write.insertInto(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data3 = [
{"id": 1, "name":"one", "count":7},
{"id": 2, "name":"two", "count": 8},
{"id": 4, "name":"three", "count": 9},
{"id": 6, "name":"six", "count":10}
]
data3 = self.spark.createDataFrame(data1)
data3.write.insertInto(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()
{code}







  was:
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.


h4. rogram output output :

{code:title=Program Output|borderStyle=solid}
17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test
/usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring 
schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
+---++-+

17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  4|   4|4|
|  5|   5|5|
|  5|   5|5|
|  6|   6|6|
|  6|   6|6|
+---++-+

{code}

In the last show(). I see the data isn't what I would expect.

{code:title=spark init|borderStyle=solid}
self.spark = SparkSession \
.builder \
.master("spark://localhost:7077") \
.enableHiveSupport() \
.getOrCreate()


{code}

{code:title=Code for the test case|borderStyle=solid}
def test_clean_insert_table(self):
table_name = "data"
data0 = [

[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data

2017-08-10 Thread Luis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis updated SPARK-21698:
-
Description: 
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.


h4. rogram output output :

{code:title=Program Output|borderStyle=solid}
17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test
/usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring 
schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
+---++-+

17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
17/08/10 16:05:07 WARN log: Updated size to 545
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  4|   4|4|
|  5|   5|5|
|  5|   5|5|
|  6|   6|6|
|  6|   6|6|
+---++-+

{code}

In the last show(). I see the data isn't what I would expect.

{code:title=spark init|borderStyle=solid}
self.spark = SparkSession \
.builder \
.master("spark://localhost:7077") \
.enableHiveSupport() \
.getOrCreate()


{code}

{code:title=Code for the test case|borderStyle=solid}
def test_clean_insert_table(self):
table_name = "data"
data0 = [
{"id": 1, "name":"1", "count": 1},
{"id": 2, "name":"2", "count": 2},
{"id": 3, "name":"3", "count": 3},
]
df_data0 = self.spark.createDataFrame(data0)

df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data1 = [
{"id": 4, "name":"4", "count": 4},
{"id": 5, "name":"5", "count": 5},
{"id": 6, "name":"6", "count": 6},
]
df_data1 = self.spark.createDataFrame(data1)
df_data1.write.insertInto(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data3 = [
{"id": 1, "name":"one", "count":7},
{"id": 2, "name":"two", "count": 8},
{"id": 4, "name":"three", "count": 9},
{"id": 6, "name":"six", "count":10}
]
data3 = self.spark.createDataFrame(data1)
data3.write.insertInto(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()
{code}







  was:
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.


h4. rogram output output :

{code:title=Program Output|borderStyle=solid}
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  4|   4|4|
|  5|   5|5|
|  5|   5|5|
|  6|   6|6|
|  6|   6|6|
+---++-+

{code}

In the last show(). I see the data isn't what I would expect.

{code:title=spark init|borderStyle=solid}
self.spark = SparkSession \
.builder \
.master("spark://localhost:7077") \
.enableHiveSupport() \
.getOrCreate()


{code}

{code:title=Code for the test case|borderStyle=solid}
def test_clean_insert_table(self):
table_name = "data"
data0 = [
{"id": 1, "name":"1", "count": 1},
{"id": 2, "name":"2", "count": 2},
{"id": 3, "name":"3", "count": 3},
]
df_data0 = self.spark.createDataFrame(data0)

df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data1 = [
{"id": 4, "name":"4", "count": 4},
{"id": 5, "name":"5", "count": 5},
{"id": 6, "name":"6", "count": 6},
]
df_data1 = self.spark.createDataFrame(data1)
df_data1.write.insertInto(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data3 = [
{"id": 1, "name":"one", "count":7},
{"id": 2, "name":"two", "count": 8},
{"id": 4, "name":"three", "count": 9},
{"id": 6, "name":"six", "count":10}
  

[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data

2017-08-10 Thread Luis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis updated SPARK-21698:
-
Description: 
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.


h4. rogram output output :

{code:title=Program Output|borderStyle=solid}
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  4|   4|4|
|  5|   5|5|
|  5|   5|5|
|  6|   6|6|
|  6|   6|6|
+---++-+

{code}

In the last show(). I see the data isn't what I would expect.

{code:title=spark init|borderStyle=solid}
self.spark = SparkSession \
.builder \
.master("spark://localhost:7077") \
.enableHiveSupport() \
.getOrCreate()


{code}

{code:title=Code for the test case|borderStyle=solid}
def test_clean_insert_table(self):
table_name = "data"
data0 = [
{"id": 1, "name":"1", "count": 1},
{"id": 2, "name":"2", "count": 2},
{"id": 3, "name":"3", "count": 3},
]
df_data0 = self.spark.createDataFrame(data0)

df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data1 = [
{"id": 4, "name":"4", "count": 4},
{"id": 5, "name":"5", "count": 5},
{"id": 6, "name":"6", "count": 6},
]
df_data1 = self.spark.createDataFrame(data1)
df_data1.write.insertInto(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()

data3 = [
{"id": 1, "name":"one", "count":7},
{"id": 2, "name":"two", "count": 8},
{"id": 4, "name":"three", "count": 9},
{"id": 6, "name":"six", "count":10}
]
data3 = self.spark.createDataFrame(data1)
data3.write.insertInto(table_name)
df_return = self.spark.read.table(table_name)
df_return.show()
{code}







  was:
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.


h4. rogram output output :

{code:title=Program Output|borderStyle=solid}
17/08/10 15:30:41 WARN [SparkUtils]: [Creating/Writing Table] data  

 
[TestSparkUtils]: DEBUG: [ Initial Create --]   

 
+-+---++

 
|count| id|name|

 
+-+---++

 
|1|  1|   1|

 
|2|  2|   2|

 
|3|  3|   3|

 
+-+---++

 


 
[TestSparkUtils]: DEBUG: [ Insert No Duplicates --] 

 
17/08/10 15:30:43 WARN [SparkUtils]: [Inserting Into Table] data

 
17/08/10 15:30:44 WARN log: Updating partition stats fast for: data  

[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data

2017-08-10 Thread Luis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis updated SPARK-21698:
-
Description: 
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.


h4. rogram output output :

{code:title=Program Output|borderStyle=solid}
17/08/10 15:30:41 WARN [SparkUtils]: [Creating/Writing Table] data  

 
[TestSparkUtils]: DEBUG: [ Initial Create --]   

 
+-+---++

 
|count| id|name|

 
+-+---++

 
|1|  1|   1|

 
|2|  2|   2|

 
|3|  3|   3|

 
+-+---++

 


 
[TestSparkUtils]: DEBUG: [ Insert No Duplicates --] 

 
17/08/10 15:30:43 WARN [SparkUtils]: [Inserting Into Table] data

 
17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 

 
17/08/10 15:30:44 WARN log: Updated size to 545 

 
17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 

 
17/08/10 15:30:44 WARN log: Updated size to 545 

 
17/08/10 15:30:44 WARN log: Updating partition stats fast for: data
17/08/10 15:30:44 WARN log: Updated size to 545
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

+-+---+-+
|count| id| name|
+-+---+-+
|7|  1|  one|
|8|  2|  two|
|9|  4|three|
|   10|  6|  six|
+-+---+-+

[TestSparkUtils]: DEBUG: [ Update --]
17/08/10 15:30:44 WARN [SparkUtils]: [Inserting Into Table] data
17/08/10 15:30:45 WARN log: Updating partition stats fast for: data
17/08/10 15:30:45 WARN log: Updated size to 1122
+---++-+
| id|name|count|
+---++-+
|  9|   4| null|
| 10|   6| null|
|  7|   1| null|
|  8|   2| null|
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

..
--
Ran 2 tests in 11.559s

OK
{code}

In the last show(). I see the data is corrupted. The data was switched on the 
columns, and I am getting null results. Below is the main clips of the code I 
am using generate the problem:

{code:title=spark init|borderStyle=solid}

[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data

2017-08-10 Thread Luis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis updated SPARK-21698:
-
Description: 
Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.


h4. rogram output output :

{code:title=Program Output|borderStyle=solid}
17/08/10 15:30:41 WARN [SparkUtils]: [Creating/Writing Table] data  

 
[TestSparkUtils]: DEBUG: [ Initial Create --]   

 
+-+---++

 
|count| id|name|

 
+-+---++

 
|1|  1|   1|

 
|2|  2|   2|

 
|3|  3|   3|

 
+-+---++

 


 
[TestSparkUtils]: DEBUG: [ Insert No Duplicates --] 

 
17/08/10 15:30:43 WARN [SparkUtils]: [Inserting Into Table] data

 
17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 

 
17/08/10 15:30:44 WARN log: Updated size to 545 

 
17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 

 
17/08/10 15:30:44 WARN log: Updated size to 545 

 
17/08/10 15:30:44 WARN log: Updating partition stats fast for: data
17/08/10 15:30:44 WARN log: Updated size to 545
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

+-+---+-+
|count| id| name|
+-+---+-+
|7|  1|  one|
|8|  2|  two|
|9|  4|three|
|   10|  6|  six|
+-+---+-+

[TestSparkUtils]: DEBUG: [ Update --]
17/08/10 15:30:44 WARN [SparkUtils]: [Inserting Into Table] data
17/08/10 15:30:45 WARN log: Updating partition stats fast for: data
17/08/10 15:30:45 WARN log: Updated size to 1122
+---++-+
| id|name|count|
+---++-+
|  9|   4| null|
| 10|   6| null|
|  7|   1| null|
|  8|   2| null|
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

..
--
Ran 2 tests in 11.559s

OK
{code}

In the last show(). I see the data is corrupted. The data was switched on the 
columns, and I am getting null results. Below is the main clips of the code I 
am using generate the problem:

{code:title=spark init|borderStyle=solid}

[jira] [Updated] (SPARK-21698) write.partitionBy() is giving me garbage data

2017-08-10 Thread Luis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis updated SPARK-21698:
-
Summary: write.partitionBy() is giving me garbage data  (was: 
write.partitionBy() is given me garbage data)

> write.partitionBy() is giving me garbage data
> -
>
> Key: SPARK-21698
> URL: https://issues.apache.org/jira/browse/SPARK-21698
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
> Environment: Linux Ubuntu 17.04.  Python 3.5.
>Reporter: Luis
>
> Spark partionBy is causing some data corruption.  I am doing three super 
> simple writes. . Below is the code to reproduce the problem.
> h4. I'll by showing the program output output :
> {code:title=Program Output|borderStyle=solid}
> 17/08/10 15:30:41 WARN [SparkUtils]: [Creating/Writing Table] data
>   
>  
> [TestSparkUtils]: DEBUG: [ Initial Create --] 
>   
>  
> +-+---++  
>   
>  
> |count| id|name|  
>   
>  
> +-+---++  
>   
>  
> |1|  1|   1|  
>   
>  
> |2|  2|   2|  
>   
>  
> |3|  3|   3|  
>   
>  
> +-+---++  
>   
>  
>   
>   
>  
> [TestSparkUtils]: DEBUG: [ Insert No Duplicates --]   
>   
>  
> 17/08/10 15:30:43 WARN [SparkUtils]: [Inserting Into Table] data  
>   
>  
> 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data   
>   
>  
> 17/08/10 15:30:44 WARN log: Updated size to 545   
>   
>  
> 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data   
>   
>  
> 17/08/10 15:30:44 WARN log: Updated size to 545   
>   
>  
> 17/08/10 15:30:44 WARN log: Updating partition stats fast for: data
> 17/08/10 15:30:44 WARN log: Updated size to 545
> +---++-+
> | id|name|count|
> +---++-+
> |  1|   1|1|
> |  2|   2|2|
> |  3|   3|3|
> |  4|   4|4|
> |  5|   5|5|
> |  6|   6|6|
> +---++-+
> +-+---+-+
> |count| id| name|
> +-+---+-+
> |7|  1|  one|
> |8|  2|  two|
> |9|  4|three|
> |   10|  6|  six|
> +-+---+-+
> [TestSparkUtils]: DEBUG: [ Update --]
> 17/08/10 15:30:44 WARN [SparkUtils]: [Inserting Into Table] data
> 17/08/10 15:30:45 WARN 

[jira] [Created] (SPARK-21698) write.partitionBy() is given me garbage data

2017-08-10 Thread Luis (JIRA)
Luis created SPARK-21698:


 Summary: write.partitionBy() is given me garbage data
 Key: SPARK-21698
 URL: https://issues.apache.org/jira/browse/SPARK-21698
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0, 2.1.1
 Environment: Linux Ubuntu 17.04.  Python 3.5.
Reporter: Luis


Spark partionBy is causing some data corruption.  I am doing three super simple 
writes. . Below is the code to reproduce the problem.


h4. I'll by showing the program output output :

{code:title=Program Output|borderStyle=solid}
17/08/10 15:30:41 WARN [SparkUtils]: [Creating/Writing Table] data  

 
[TestSparkUtils]: DEBUG: [ Initial Create --]   

 
+-+---++

 
|count| id|name|

 
+-+---++

 
|1|  1|   1|

 
|2|  2|   2|

 
|3|  3|   3|

 
+-+---++

 


 
[TestSparkUtils]: DEBUG: [ Insert No Duplicates --] 

 
17/08/10 15:30:43 WARN [SparkUtils]: [Inserting Into Table] data

 
17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 

 
17/08/10 15:30:44 WARN log: Updated size to 545 

 
17/08/10 15:30:44 WARN log: Updating partition stats fast for: data 

 
17/08/10 15:30:44 WARN log: Updated size to 545 

 
17/08/10 15:30:44 WARN log: Updating partition stats fast for: data
17/08/10 15:30:44 WARN log: Updated size to 545
+---++-+
| id|name|count|
+---++-+
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

+-+---+-+
|count| id| name|
+-+---+-+
|7|  1|  one|
|8|  2|  two|
|9|  4|three|
|   10|  6|  six|
+-+---+-+

[TestSparkUtils]: DEBUG: [ Update --]
17/08/10 15:30:44 WARN [SparkUtils]: [Inserting Into Table] data
17/08/10 15:30:45 WARN log: Updating partition stats fast for: data
17/08/10 15:30:45 WARN log: Updated size to 1122
+---++-+
| id|name|count|
+---++-+
|  9|   4| null|
| 10|   6| null|
|  7|   1| null|
|  8|   2| null|
|  1|   1|1|
|  2|   2|2|
|  3|   3|3|
|  4|   4|4|
|  5|   5|5|
|  6|   6|6|
+---++-+

..
--
Ran 2 tests in 

[jira] [Resolved] (SPARK-21638) Warning message of RF is not accurate

2017-08-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21638.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18868
[https://github.com/apache/spark/pull/18868]

> Warning message of RF is not accurate
> -
>
> Key: SPARK-21638
> URL: https://issues.apache.org/jira/browse/SPARK-21638
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: 
>Reporter: Peng Meng
>Priority: Minor
> Fix For: 2.3.0
>
>
> When train RF model, there is many warning message like this:
> {quote}WARN RandomForest: Tree learning is using approximately 268492800 
> bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. 
> This allows splitting 2622 nodes in this iteration.{quote}
> This warning message is unnecessary and the data is not accurate.
> Actually, if all the nodes cannot split in one iteration, it will show this 
> warning. For most of the case, all the nodes cannot split just in one 
> iteration, so for most of the case, it will show this warning for each 
> iteration.
> This is because:
> {code:java}
> while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
>   val (treeIndex, node) = nodeStack.top
>   // Choose subset of features for node (if subsampling).
>   val featureSubset: Option[Array[Int]] = if 
> (metadata.subsamplingFeatures) {
> Some(SamplingUtils.reservoirSampleAndCount(Range(0,
>   metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
> rng.nextLong())._1)
>   } else {
> None
>   }
>   // Check if enough memory remains to add this node to the group.
>   val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
> featureSubset) * 8L
>   if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
> nodeStack.pop()
> mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
> mutable.ArrayBuffer[LearningNode]()) +=
>   node
> mutableTreeToNodeToIndexInfo
>   .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
> NodeIndexInfo]())(node.id)
>   = new NodeIndexInfo(numNodesInGroup, featureSubset)
>   }
>   numNodesInGroup += 1   //we not add the node to mutableNodesForGroup, 
> but we add memUsage here.
>   memUsage += nodeMemUsage
> }
> if (memUsage > maxMemoryUsage) {
>   // If maxMemoryUsage is 0, we should still allow splitting 1 node.
>   logWarning(s"Tree learning is using approximately $memUsage bytes per 
> iteration, which" +
> s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This 
> allows splitting" +
> s" $numNodesInGroup nodes in this iteration.")
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21638) Warning message of RF is not accurate

2017-08-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21638:
-

Assignee: Peng Meng

> Warning message of RF is not accurate
> -
>
> Key: SPARK-21638
> URL: https://issues.apache.org/jira/browse/SPARK-21638
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: 
>Reporter: Peng Meng
>Assignee: Peng Meng
>Priority: Minor
> Fix For: 2.3.0
>
>
> When train RF model, there is many warning message like this:
> {quote}WARN RandomForest: Tree learning is using approximately 268492800 
> bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. 
> This allows splitting 2622 nodes in this iteration.{quote}
> This warning message is unnecessary and the data is not accurate.
> Actually, if all the nodes cannot split in one iteration, it will show this 
> warning. For most of the case, all the nodes cannot split just in one 
> iteration, so for most of the case, it will show this warning for each 
> iteration.
> This is because:
> {code:java}
> while (nodeStack.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
>   val (treeIndex, node) = nodeStack.top
>   // Choose subset of features for node (if subsampling).
>   val featureSubset: Option[Array[Int]] = if 
> (metadata.subsamplingFeatures) {
> Some(SamplingUtils.reservoirSampleAndCount(Range(0,
>   metadata.numFeatures).iterator, metadata.numFeaturesPerNode, 
> rng.nextLong())._1)
>   } else {
> None
>   }
>   // Check if enough memory remains to add this node to the group.
>   val nodeMemUsage = RandomForest.aggregateSizeForNode(metadata, 
> featureSubset) * 8L
>   if (memUsage + nodeMemUsage <= maxMemoryUsage || memUsage == 0) {
> nodeStack.pop()
> mutableNodesForGroup.getOrElseUpdate(treeIndex, new 
> mutable.ArrayBuffer[LearningNode]()) +=
>   node
> mutableTreeToNodeToIndexInfo
>   .getOrElseUpdate(treeIndex, new mutable.HashMap[Int, 
> NodeIndexInfo]())(node.id)
>   = new NodeIndexInfo(numNodesInGroup, featureSubset)
>   }
>   numNodesInGroup += 1   //we not add the node to mutableNodesForGroup, 
> but we add memUsage here.
>   memUsage += nodeMemUsage
> }
> if (memUsage > maxMemoryUsage) {
>   // If maxMemoryUsage is 0, we should still allow splitting 1 node.
>   logWarning(s"Tree learning is using approximately $memUsage bytes per 
> iteration, which" +
> s" exceeds requested limit maxMemoryUsage=$maxMemoryUsage. This 
> allows splitting" +
> s" $numNodesInGroup nodes in this iteration.")
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS

2017-08-10 Thread Weiqing Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122234#comment-16122234
 ] 

Weiqing Yang commented on SPARK-21697:
--

Thanks for filing this issue!

> NPE & ExceptionInInitializerError trying to load UTF from HDFS
> --
>
> Key: SPARK-21697
> URL: https://issues.apache.org/jira/browse/SPARK-21697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Spark Client mode, Hadoop 2.6.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Reported on [the 
> PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for 
> SPARK-12868: trying to load a UDF of HDFS is triggering an 
> {{ExceptionInInitializerError}}, caused by an NPE which should only happen if 
> the commons-logging {{LOG}} log is null.
> Hypothesis: the commons logging scan for {{commons-logging.properties}} is 
> happening in the classpath with the HDFS JAR; this is triggering a D/L of the 
> JAR, which needs to force in commons-logging, and, as that's not inited yet, 
> NPEs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS

2017-08-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122195#comment-16122195
 ] 

Steve Loughran edited comment on SPARK-21697 at 8/10/17 7:48 PM:
-

Text of the PR comment & stack

bq. Have u tried it in yarn-client mode? i add this path in v2.1.1 + Hadoop 
2.6.0, when i run "add jar" through SparkSQL CLI , it comes out this error:

{code}
ERROR thriftserver.SparkSQLDriver: Failed in [add jar 
hdfs://SunshineNameNode3:8020/lib/clouddata-common-lib/chardet-0.0.1.jar]
java.lang.ExceptionInInitializerError
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:947)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2107)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2076)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2052)
at 
org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1274)
at 
org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
at 
org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
at 
org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
at 
org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:632)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:601)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:278)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:267)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:601)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:591)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:738)
at org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:105)
at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.Dataset.(Dataset.scala:185)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 

[jira] [Commented] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS

2017-08-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122195#comment-16122195
 ] 

Steve Loughran commented on SPARK-21697:


{code}
Have u tried it in yarn-client mode? i add this path in v2.1.1 + Hadoop 2.6.0, 
when i run "add jar" through SparkSQL CLI , it comes out this error:
ERROR thriftserver.SparkSQLDriver: Failed in [add jar 
hdfs://SunshineNameNode3:8020/lib/clouddata-common-lib/chardet-0.0.1.jar]
java.lang.ExceptionInInitializerError
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:947)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2107)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2076)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2052)
at 
org.apache.hadoop.hive.ql.session.SessionState.downloadResource(SessionState.java:1274)
at 
org.apache.hadoop.hive.ql.session.SessionState.resolveAndDownload(SessionState.java:1242)
at 
org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1163)
at 
org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
at 
org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:632)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:601)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:278)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:225)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:224)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:267)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:601)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:591)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:738)
at org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:105)
at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.Dataset.(Dataset.scala:185)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:699)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:335)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:247)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
at 

[jira] [Created] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS

2017-08-10 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-21697:
--

 Summary: NPE & ExceptionInInitializerError trying to load UTF from 
HDFS
 Key: SPARK-21697
 URL: https://issues.apache.org/jira/browse/SPARK-21697
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
 Environment: Spark Client mode, Hadoop 2.6.0
Reporter: Steve Loughran
Priority: Minor


Reported on [the 
PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for 
SPARK-12868: trying to load a UDF of HDFS is triggering an 
{{ExceptionInInitializerError}}, caused by an NPE which should only happen if 
the commons-logging {{LOG}} log is null.

Hypothesis: the commons logging scan for {{commons-logging.properties}} is 
happening in the classpath with the HDFS JAR; this is triggering a D/L of the 
JAR, which needs to force in commons-logging, and, as that's not inited yet, 
NPEs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21669) Internal API for collecting metrics/stats during FileFormatWriter jobs

2017-08-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-21669.
-
   Resolution: Fixed
 Assignee: Adrian Ionescu
Fix Version/s: 2.3.0

> Internal API for collecting metrics/stats during FileFormatWriter jobs
> --
>
> Key: SPARK-21669
> URL: https://issues.apache.org/jira/browse/SPARK-21669
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Adrian Ionescu
>Assignee: Adrian Ionescu
> Fix For: 2.3.0
>
>
> It would be useful to have some infrastructure in place for collecting custom 
> metrics or statistics on data on the fly, as it is being written to disk.
> This was inspired by the work in SPARK-20703, which added simple metrics 
> collection for data write operations, such as {{numFiles}}, 
> {{numPartitions}}, {{numRows}}. Those metrics are first collected on the 
> executors and then sent to the driver, which aggregates and posts them as 
> updates to the {{SQLMetrics}} subsystem.
> The above can be generalized and turned into a pluggable interface, which in 
> the future could be used for other purposes: e.g. automatic maintenance of 
> cost-based optimizer (CBO) statistics during "INSERT INTO  SELECT ..." 
> operations, such that users won't need to explicitly call "ANALYZE TABLE 
>  COMPUTE STATISTICS" afterwards anymore, thus avoiding an extra 
> full-table scan.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21696) State Store can't handle corrupted snapshots

2017-08-10 Thread Alexander Bessonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122078#comment-16122078
 ] 

Alexander Bessonov commented on SPARK-21696:


{{HDFSBackedStateStoreProvider.doMaintenance()}} will supress any {{NonFatal}} 
exceptions. {{startMaintenanceIfNeeded.startMaintenanceIfNeeded()}} wouldn't 
restart maintenance if crashed. State Store still can function even when 
snapshot file is corrupted by simply falling back to deltas.

> State Store can't handle corrupted snapshots
> 
>
> Key: SPARK-21696
> URL: https://issues.apache.org/jira/browse/SPARK-21696
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Alexander Bessonov
>Priority: Critical
>
> State store's asynchronous maintenance task (generation of Snapshot files) is 
> not rescheduled if crashed which might lead to corrupted snapshots.
> In our case, on multiple occasions, executors died during maintenance task 
> with Out Of Memory error which led to following error on recovery:
> {code:none}
> 17/08/07 20:12:24 WARN TaskSetManager: Lost task 3.1 in stage 102.0 (TID 
> 3314, dnj2-bach-r2n10.bloomberg.com, executor 94): java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:436)
> at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:314)
> at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:313)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:313)
> at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:186)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> 

[jira] [Updated] (SPARK-21696) State Store can't handle corrupted snapshots

2017-08-10 Thread Alexander Bessonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Bessonov updated SPARK-21696:
---
Description: 
State store's asynchronous maintenance task (generation of Snapshot files) is 
not rescheduled if crashed which might lead to corrupted snapshots.

In our case, on multiple occasions, executors died during maintenance task with 
Out Of Memory error which led to following error on recovery:
{code:none}
17/08/07 20:12:24 WARN TaskSetManager: Lost task 3.1 in stage 102.0 (TID 3314, 
dnj2-bach-r2n10.bloomberg.com, executor 94): java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:436)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:314)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:313)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:313)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:186)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
State store's asynchronous maintenance task (generation of Snapshot files) is 
not rescheduled if crashed which might lead to corrupted snapshots.

In our case, on multiple occasions, executors died during maintenance task with 
Out Of Memory error which led to following error on recovery:
{code:text}
17/08/07 20:12:24 WARN TaskSetManager: Lost task 3.1 in stage 102.0 (TID 3314, 
dnj2-bach-r2n10.bloomberg.com, executor 94): java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 

[jira] [Created] (SPARK-21696) State Store can't handle corrupted snapshots

2017-08-10 Thread Alexander Bessonov (JIRA)
Alexander Bessonov created SPARK-21696:
--

 Summary: State Store can't handle corrupted snapshots
 Key: SPARK-21696
 URL: https://issues.apache.org/jira/browse/SPARK-21696
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: Alexander Bessonov
Priority: Critical


State store's asynchronous maintenance task (generation of Snapshot files) is 
not rescheduled if crashed which might lead to corrupted snapshots.

In our case, on multiple occasions, executors died during maintenance task with 
Out Of Memory error which led to following error on recovery:
{code:text}
17/08/07 20:12:24 WARN TaskSetManager: Lost task 3.1 in stage 102.0 (TID 3314, 
dnj2-bach-r2n10.bloomberg.com, executor 94): java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:436)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:314)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:313)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:313)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:186)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17419) Mesos virtual network support

2017-08-10 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122019#comment-16122019
 ] 

Susan X. Huynh commented on SPARK-17419:


SPARK-21694 allows the user to pass network labels to CNI plugins.

> Mesos virtual network support
> -
>
> Key: SPARK-17419
> URL: https://issues.apache.org/jira/browse/SPARK-17419
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Michael Gummelt
>
> http://mesos.apache.org/documentation/latest/cni/
> This will enable launching executors into virtual networks for isolation and 
> security. It will also enable container per IP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17419) Mesos virtual network support

2017-08-10 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122018#comment-16122018
 ] 

Susan X. Huynh commented on SPARK-17419:


SPARK-18232 adds the ability to launch containers attached to a CNI network, by 
specifying `--conf spark.mesos.network.name`.

> Mesos virtual network support
> -
>
> Key: SPARK-17419
> URL: https://issues.apache.org/jira/browse/SPARK-17419
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Michael Gummelt
>
> http://mesos.apache.org/documentation/latest/cni/
> This will enable launching executors into virtual networks for isolation and 
> security. It will also enable container per IP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21695) Spark scheduler locality algorithm can take longer then expected

2017-08-10 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-21695:
-

 Summary: Spark scheduler locality algorithm can take longer then 
expected
 Key: SPARK-21695
 URL: https://issues.apache.org/jira/browse/SPARK-21695
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Thomas Graves


Reference jira https://issues.apache.org/jira/browse/SPARK-21656

I'm seeing an issue with some jobs where the scheduler takes a long time to 
schedule tasks on executors.   The default locality wait is 3 seconds so I was 
expecting that an executor should get some task on it in max 9 seconds (node 
local, rack local, any), but its taking way more time then that.  In the case 
of spark-21656 it takes 60+ seconds and executors idle timeout.  

We should investigate why and see if we can fix this.

Upon an initial look it seems the scheduler resets the locality lastLaunchTime 
whenever it places any task on a node at that locality level. It appears this 
means it can take way longer then 3 seconds for any particular task to fall 
back, but this needs to be verified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21694) Support Mesos CNI network labels

2017-08-10 Thread Susan X. Huynh (JIRA)
Susan X. Huynh created SPARK-21694:
--

 Summary: Support Mesos CNI network labels
 Key: SPARK-21694
 URL: https://issues.apache.org/jira/browse/SPARK-21694
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.2.0
Reporter: Susan X. Huynh
 Fix For: 2.3.0


Background: SPARK-18232 added the ability to launch containers attached to a 
CNI network by specifying the network name via `spark.mesos.network.name`.

This ticket is to allow the user to pass network labels to CNI plugins. More 
details in the related Mesos documentation: 
http://mesos.apache.org/documentation/latest/cni/#mesos-meta-data-to-cni-plugins



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21688:
--
Priority: Minor  (was: Major)

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
>Priority: Minor
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21644) LocalLimit.maxRows is defined incorrectly

2017-08-10 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121946#comment-16121946
 ] 

Xiao Li commented on SPARK-21644:
-

https://github.com/apache/spark/pull/18851

> LocalLimit.maxRows is defined incorrectly
> -
>
> Key: SPARK-21644
> URL: https://issues.apache.org/jira/browse/SPARK-21644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> {code}
> case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends 
> UnaryNode {
>   override def output: Seq[Attribute] = child.output
>   override def maxRows: Option[Long] = {
> limitExpr match {
>   case IntegerLiteral(limit) => Some(limit)
>   case _ => None
> }
>   }
> }
> {code}
> This is simply wrong, since LocalLimit is only about partition level limits.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2017-08-10 Thread Chaoyu Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121931#comment-16121931
 ] 

Chaoyu Tang commented on SPARK-14927:
-

[~rajeshc] could you provide your example here?

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18648) spark-shell --jars option does not add jars to classpath on windows

2017-08-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-18648.
--
Resolution: Duplicate

I checked {{spark-shell --jars C:\test\my.jar}} works and fixed. I am resolving 
this.

> spark-shell --jars option does not add jars to classpath on windows
> ---
>
> Key: SPARK-18648
> URL: https://issues.apache.org/jira/browse/SPARK-18648
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Affects Versions: 2.0.2
> Environment: Windows 7 x64
>Reporter: Michel Lemay
>  Labels: windows
>
> I can't import symbols from command line jars when in the shell:
> Adding jars via --jars:
> {code}
> spark-shell --master local[*] --jars path\to\deeplearning4j-core-0.7.0.jar
> {code}
> Same result if I add it through maven coordinates:
> {code}spark-shell --master local[*] --packages 
> org.deeplearning4j:deeplearning4j-core:0.7.0
> {code}
> I end up with:
> {code}
> scala> import org.deeplearning4j
> :23: error: object deeplearning4j is not a member of package org
>import org.deeplearning4j
> {code}
> NOTE: It is working as expected when running on linux.
> Sample output with --verbose:
> {code}
> Using properties file: null
> Parsed arguments:
>   master  local[*]
>   deployMode  null
>   executorMemory  null
>   executorCores   null
>   totalExecutorCores  null
>   propertiesFile  null
>   driverMemorynull
>   driverCores null
>   driverExtraClassPathnull
>   driverExtraLibraryPath  null
>   driverExtraJavaOptions  null
>   supervise   false
>   queue   null
>   numExecutorsnull
>   files   null
>   pyFiles null
>   archivesnull
>   mainClass   org.apache.spark.repl.Main
>   primaryResource spark-shell
>   nameSpark shell
>   childArgs   []
>   jars
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
>   packagesnull
>   packagesExclusions  null
>   repositoriesnull
>   verbose true
> Spark properties used, including those specified through
>  --conf and those from the properties file null:
> Main class:
> org.apache.spark.repl.Main
> Arguments:
> System properties:
> SPARK_SUBMIT -> true
> spark.app.name -> Spark shell
> spark.jars -> 
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
> spark.submit.deployMode -> client
> spark.master -> local[*]
> Classpath elements:
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
> 16/11/30 08:30:49 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/11/30 08:30:51 WARN SparkContext: Use an existing SparkContext, some 
> configuration may not take effect.
> Spark context Web UI available at http://192.168.70.164:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1480512651325).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.2
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> import org.deeplearning4j
> :23: error: object deeplearning4j is not a member of package org
>import org.deeplearning4j
>   ^
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-08-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121888#comment-16121888
 ] 

Hyukjin Kwon commented on SPARK-21693:
--

Yes, it does build multiple times and If I have observed this correctly, it 
won't affect queuing particularly but it'd add roughly 25-30ish mins more for 
each build .. Will check out other possible things too and also try to check 
each time in each test in "MLlib classification algorithms".

> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in AppVeyor. I asked this for my account few times before but 
> it looks we can't increase this time limit again and again.
> I could identify two things that look taking a quite a bit of time:
> 1. Disabled cache feature in pull request builder, which ends up downloading 
> Maven dependencies (10-20ish mins)
> https://www.appveyor.com/docs/build-cache/
> {quote}
> Note: Saving cache is disabled in Pull Request builds.
> {quote}
> and also see 
> http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working
> This seems difficult to fix within Spark.
> 2. "MLlib classification algorithms" tests (30-35ish mins)
> This test below looks taking 30-35ish mins.
> {code}
> MLlib classification algorithms, except for tree-based algorithms: Spark 
> package found in SPARK_HOME: C:\projects\spark\bin\..
> ..
> {code}
> As a (I think) last resort, we could make a matrix for this test alone, so 
> that we run the other tests after a build and then run this test after 
> another build, for example, I run Scala tests by this workaround - 
> https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
> with 7 build and test each).
> I am also checking and testing other ways.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-08-10 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121866#comment-16121866
 ] 

Felix Cheung commented on SPARK-21693:
--

splitting test matrix is also possible, I worry though since caching is 
disabled, then isn't Spark jar being built multiple times? My main concerns are 
how long tests will run and whether that will lengthen queuing of test runs 
(which could get quite long already and people are ignoring pending appveyor 
runs sometimes)

> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in AppVeyor. I asked this for my account few times before but 
> it looks we can't increase this time limit again and again.
> I could identify two things that look taking a quite a bit of time:
> 1. Disabled cache feature in pull request builder, which ends up downloading 
> Maven dependencies (10-20ish mins)
> https://www.appveyor.com/docs/build-cache/
> {quote}
> Note: Saving cache is disabled in Pull Request builds.
> {quote}
> and also see 
> http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working
> This seems difficult to fix within Spark.
> 2. "MLlib classification algorithms" tests (30-35ish mins)
> This test below looks taking 30-35ish mins.
> {code}
> MLlib classification algorithms, except for tree-based algorithms: Spark 
> package found in SPARK_HOME: C:\projects\spark\bin\..
> ..
> {code}
> As a (I think) last resort, we could make a matrix for this test alone, so 
> that we run the other tests after a build and then run this test after 
> another build, for example, I run Scala tests by this workaround - 
> https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
> with 7 build and test each).
> I am also checking and testing other ways.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-08-10 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121860#comment-16121860
 ] 

Felix Cheung commented on SPARK-21693:
--

we could certainly simplify the classification set - but there's a fair number 
of API being tested in their, perhaps we could time them to see which ones are 
taking time.

> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in AppVeyor. I asked this for my account few times before but 
> it looks we can't increase this time limit again and again.
> I could identify two things that look taking a quite a bit of time:
> 1. Disabled cache feature in pull request builder, which ends up downloading 
> Maven dependencies (10-20ish mins)
> https://www.appveyor.com/docs/build-cache/
> {quote}
> Note: Saving cache is disabled in Pull Request builds.
> {quote}
> and also see 
> http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working
> This seems difficult to fix within Spark.
> 2. "MLlib classification algorithms" tests (30-35ish mins)
> This test below looks taking 30-35ish mins.
> {code}
> MLlib classification algorithms, except for tree-based algorithms: Spark 
> package found in SPARK_HOME: C:\projects\spark\bin\..
> ..
> {code}
> As a (I think) last resort, we could make a matrix for this test alone, so 
> that we run the other tests after a build and then run this test after 
> another build, for example, I run Scala tests by this workaround - 
> https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
> with 7 build and test each).
> I am also checking and testing other ways.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21535) Reduce memory requirement for CrossValidator and TrainValidationSplit

2017-08-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang resolved SPARK-21535.

Resolution: Not A Problem

The new implementation will load the evaluation dataset when training model and 
may not always present a better performance. Please refer to the discussion in 
the PR.

> Reduce memory requirement for CrossValidator and TrainValidationSplit 
> --
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use 
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where 
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation 
> consumes extra driver memory for holding the trained models, which is not 
> necessary and often leads to memory exception for both CrossValidator and 
> TrainValidationSplit. My proposal is to optimize the training implementation, 
> thus that used model can be collected by GC, and avoid the unnecessary OOM 
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12 
> trained models in the driver memory at the same time, while the new 
> implementation only needs to hold 1 trained model at a time, and previous 
> model can be cleared by GC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-08-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21693:
-
Description: 
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in AppVeyor. I asked this for my account few times before but 
it looks we can't increase this time limit again and again.

I could identify two things that look taking a quite a bit of time:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (10-20ish mins)


https://www.appveyor.com/docs/build-cache/

{quote}
Note: Saving cache is disabled in Pull Request builds.
{quote}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).

I am also checking and testing other ways.


  was:
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in AppVeyor. I asked this for my account few times before but 
it looks we can't increase this time limit again and again.

I could identify three things that look taking a quite a bit of time:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (10-20ish mins)


https://www.appveyor.com/docs/build-cache/

{quote}
Note: Saving cache is disabled in Pull Request builds.
{quote}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).


3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of 
{{mcfork}}

See [this 
codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392].
 We disabled this feature and currently fork processes from Java that is 
expensive. I haven't tested this yet but maybe reducing 
{{spark.sql.shuffle.partitions}} can be an approach to work around this. 
Currently, if I understood correctly, this is 200 by default in R tests, which 
ends up with 200 Java processes for every shuffle.




> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in AppVeyor. I asked this for my account few times before but 
> it looks we can't increase this time limit again and again.
> I could identify two things that look taking a quite a bit of time:
> 1. Disabled cache feature in pull request builder, which ends up downloading 
> Maven dependencies (10-20ish mins)
> https://www.appveyor.com/docs/build-cache/
> {quote}
> Note: Saving cache is disabled in Pull Request builds.
> {quote}
> and also see 
> http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working
> This seems difficult to fix within Spark.
> 2. 

[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-08-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21693:
-
Description: 
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in AppVeyor. I asked this for my account few times before but 
it looks we can't increase this time limit again and again.

I could identify three things that look taking a quite a bit of time:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (10-20ish mins)


https://www.appveyor.com/docs/build-cache/

{quote}
Note: Saving cache is disabled in Pull Request builds.
{quote}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).


3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of 
{{mcfork}}

See [this 
codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392].
 We disabled this feature and currently fork processes from Java that is 
expensive. I haven't tested this yet but maybe reducing 
{{spark.sql.shuffle.partitions}} can be an approach to work around this. 
Currently, if I understood correctly, this is 200 by default in R tests, which 
ends up with 200 Java processes for every shuffle.



  was:
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in AppVeyor. I asked this for my account few times before but 
it looks we can't increase this time limit again and again.

I could identify three things that look taking a quite a bit of times:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (10-20ish mins)


https://www.appveyor.com/docs/build-cache/

{quote}
Note: Saving cache is disabled in Pull Request builds.
{quote}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).


3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of 
{{mcfork}}

See [this 
codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392].
 We disabled this feature and currently fork processes from Java that is 
expensive. I haven't tested this yet but maybe reducing 
{{spark.sql.shuffle.partitions}} can be an approach to work around this. 
Currently, if I understood correctly, this is 200 by default in R tests, which 
ends up with 200 Java processes for every shuffle.




> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in AppVeyor. I asked this for my 

[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-08-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21693:
-
Description: 
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in AppVeyor. I asked this for my account few times before but 
it looks we can't increase this time limit again and again.

I could identify three things that look taking a quite a bit of times:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (10-20ish mins)


https://www.appveyor.com/docs/build-cache/

{quote}
Note: Saving cache is disabled in Pull Request builds.
{quote}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).


3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of 
{{mcfork}}

See [this 
codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392].
 We disabled this feature and currently fork processes from Java that is 
expensive. I haven't tested this yet but maybe reducing 
{{spark.sql.shuffle.partitions}} can be an approach to work around this. 
Currently, if I understood correctly, this is 200 by default in R tests, which 
ends up with 200 Java processes for every shuffle.



  was:
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in AppVeyor. I asked this for my account few times before but 
it looks we can't increase this time limit again and again.

I could identify three things that take a quite a bit of times:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (15-20ish mins)


https://www.appveyor.com/docs/build-cache/

{quote}
Note: Saving cache is disabled in Pull Request builds.
{quote}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).


3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of 
{{mcfork}}

See [this 
codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392].
 We disabled this feature and currently fork processes from Java that is 
expensive. I haven't tested this yet but maybe reducing 
{{spark.sql.shuffle.partitions}} can be an approach to work around this. 
Currently, if I understood correctly, this is 200 by default in R tests, which 
ends up with 200 Java processes for every shuffle.




> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in AppVeyor. I asked this for my account few 

[jira] [Commented] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-08-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121827#comment-16121827
 ] 

Hyukjin Kwon commented on SPARK-21693:
--

FYI, [~felixcheung] and [~shivaram].

> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in AppVeyor. I asked this for my account few times before but 
> it looks we can't increase this time limit again and again.
> I could identify three things that take a quite a bit of times:
> 1. Disabled cache feature in pull request builder, which ends up downloading 
> Maven dependencies (15-20ish mins)
> https://www.appveyor.com/docs/build-cache/
> {quote}
> Note: Saving cache is disabled in Pull Request builds.
> {quote}
> and also see 
> http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working
> This seems difficult to fix within Spark.
> 2. "MLlib classification algorithms" tests (30-35ish mins)
> This test below looks taking 30-35ish mins.
> {code}
> MLlib classification algorithms, except for tree-based algorithms: Spark 
> package found in SPARK_HOME: C:\projects\spark\bin\..
> ..
> {code}
> As a (I think) last resort, we could make a matrix for this test alone, so 
> that we run the other tests after a build and then run this test after 
> another build, for example, I run Scala tests by this workaround - 
> https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
> with 7 build and test each).
> 3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of 
> {{mcfork}}
> See [this 
> codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392].
>  We disabled this feature and currently fork processes from Java that is 
> expensive. I haven't tested this yet but maybe reducing 
> {{spark.sql.shuffle.partitions}} can be an approach to work around this. 
> Currently, if I understood correctly, this is 200 by default in R tests, 
> which ends up with 200 Java processes for every shuffle.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-08-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21693:
-
Description: 
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in AppVeyor. I asked this for my account few times before but 
it looks we can't increase this time limit again and again.

I could identify three things that take a quite a bit of times:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (15-20ish mins)


https://www.appveyor.com/docs/build-cache/

{quote}
Note: Saving cache is disabled in Pull Request builds.
{quote}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).


3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of 
{{mcfork}}

See [this 
codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392].
 We disabled this feature and currently fork processes from Java that is 
expensive. I haven't tested this yet but maybe reducing 
{{spark.sql.shuffle.partitions}} can be an approach to work around this. 
Currently, if I understood correctly, this is 200 by default in R tests, which 
ends up with 200 Java processes for every shuffle.



  was:
We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in AppVeyor. I asked this for my account few times before but 
it looks we can't increase this time limit again and again.

I could identify three things that take a quite a bit of times:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (15-20ish mins)


https://www.appveyor.com/docs/build-cache/

{code}
Note: Saving cache is disabled in Pull Request builds.
{code}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).


3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of 
{{mcfork}}

See [this 
codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392].
 We disabled this feature and currently fork processes from Java that is 
expensive. I haven't tested this yet but maybe reducing 
{{spark.sql.shuffle.partitions}} can be an approach to work around this. 
Currently, if I understood correctly, this is 200 by default in R tests, which 
ends up with 200 Java processes for every shuffle.




> AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests
> -
>
> Key: SPARK-21693
> URL: https://issues.apache.org/jira/browse/SPARK-21693
> Project: Spark
>  Issue Type: Test
>  Components: Build, SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> We finally sometimes reach the time limit, 1.5 hours, 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
> I requested to increase this from an hour to 1.5 hours before but it looks we 
> should fix this in AppVeyor. I asked this for my account few times 

[jira] [Created] (SPARK-21693) AppVeyor tests reach the time limit, 1.5 hours, sometimes in SparkR tests

2017-08-10 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-21693:


 Summary: AppVeyor tests reach the time limit, 1.5 hours, sometimes 
in SparkR tests
 Key: SPARK-21693
 URL: https://issues.apache.org/jira/browse/SPARK-21693
 Project: Spark
  Issue Type: Test
  Components: Build, SparkR
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


We finally sometimes reach the time limit, 1.5 hours, 
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1676-master
I requested to increase this from an hour to 1.5 hours before but it looks we 
should fix this in AppVeyor. I asked this for my account few times before but 
it looks we can't increase this time limit again and again.

I could identify three things that take a quite a bit of times:


1. Disabled cache feature in pull request builder, which ends up downloading 
Maven dependencies (15-20ish mins)


https://www.appveyor.com/docs/build-cache/

{code}
Note: Saving cache is disabled in Pull Request builds.
{code}

and also see 
http://help.appveyor.com/discussions/problems/4159-cache-doesnt-seem-to-be-working

This seems difficult to fix within Spark.


2. "MLlib classification algorithms" tests (30-35ish mins)

This test below looks taking 30-35ish mins.

{code}
MLlib classification algorithms, except for tree-based algorithms: Spark 
package found in SPARK_HOME: C:\projects\spark\bin\..
..
{code}

As a (I think) last resort, we could make a matrix for this test alone, so that 
we run the other tests after a build and then run this test after another 
build, for example, I run Scala tests by this workaround - 
https://ci.appveyor.com/project/spark-test/spark/build/757-20170716 (a matrix 
with 7 build and test each).


3. Disabled {{spark.sparkr.use.daemon}} on Windows due to the limitation of 
{{mcfork}}

See [this 
codes|https://github.com/apache/spark/blob/478fbc866fbfdb4439788583281863ecea14e8af/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L362-L392].
 We disabled this feature and currently fork processes from Java that is 
expensive. I haven't tested this yet but maybe reducing 
{{spark.sql.shuffle.partitions}} can be an approach to work around this. 
Currently, if I understood correctly, this is 200 by default in R tests, which 
ends up with 200 Java processes for every shuffle.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18648) spark-shell --jars option does not add jars to classpath on windows

2017-08-10 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121817#comment-16121817
 ] 

Devaraj K commented on SPARK-18648:
---

[~FlamingMike], It has fixed as part of SPARK-21339, can you check this issue 
with SPARK-21339 change if you have chance? Thanks

> spark-shell --jars option does not add jars to classpath on windows
> ---
>
> Key: SPARK-18648
> URL: https://issues.apache.org/jira/browse/SPARK-18648
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Affects Versions: 2.0.2
> Environment: Windows 7 x64
>Reporter: Michel Lemay
>  Labels: windows
>
> I can't import symbols from command line jars when in the shell:
> Adding jars via --jars:
> {code}
> spark-shell --master local[*] --jars path\to\deeplearning4j-core-0.7.0.jar
> {code}
> Same result if I add it through maven coordinates:
> {code}spark-shell --master local[*] --packages 
> org.deeplearning4j:deeplearning4j-core:0.7.0
> {code}
> I end up with:
> {code}
> scala> import org.deeplearning4j
> :23: error: object deeplearning4j is not a member of package org
>import org.deeplearning4j
> {code}
> NOTE: It is working as expected when running on linux.
> Sample output with --verbose:
> {code}
> Using properties file: null
> Parsed arguments:
>   master  local[*]
>   deployMode  null
>   executorMemory  null
>   executorCores   null
>   totalExecutorCores  null
>   propertiesFile  null
>   driverMemorynull
>   driverCores null
>   driverExtraClassPathnull
>   driverExtraLibraryPath  null
>   driverExtraJavaOptions  null
>   supervise   false
>   queue   null
>   numExecutorsnull
>   files   null
>   pyFiles null
>   archivesnull
>   mainClass   org.apache.spark.repl.Main
>   primaryResource spark-shell
>   nameSpark shell
>   childArgs   []
>   jars
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
>   packagesnull
>   packagesExclusions  null
>   repositoriesnull
>   verbose true
> Spark properties used, including those specified through
>  --conf and those from the properties file null:
> Main class:
> org.apache.spark.repl.Main
> Arguments:
> System properties:
> SPARK_SUBMIT -> true
> spark.app.name -> Spark shell
> spark.jars -> 
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
> spark.submit.deployMode -> client
> spark.master -> local[*]
> Classpath elements:
> file:/C:/Apps/Spark/spark-2.0.2-bin-hadoop2.4/bin/../deeplearning4j-core-0.7.0.jar
> 16/11/30 08:30:49 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/11/30 08:30:51 WARN SparkContext: Use an existing SparkContext, some 
> configuration may not take effect.
> Spark context Web UI available at http://192.168.70.164:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1480512651325).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.2
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> import org.deeplearning4j
> :23: error: object deeplearning4j is not a member of package org
>import org.deeplearning4j
>   ^
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-10 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121801#comment-16121801
 ] 

Ruslan Dautkhanov commented on SPARK-21657:
---

[~bjornjons] confirms this problem pertains to Spark 2.2 too.


> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sizes nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 
> 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21692) Modify PythonUDF to support nullability

2017-08-10 Thread Michael Styles (JIRA)
Michael Styles created SPARK-21692:
--

 Summary: Modify PythonUDF to support nullability
 Key: SPARK-21692
 URL: https://issues.apache.org/jira/browse/SPARK-21692
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Michael Styles


When creating or registering Python UDFs, a user may know whether null values 
can be returned by the function. PythonUDF and related classes should be 
modified to support nullability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.

2017-08-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121740#comment-16121740
 ] 

Liang-Chi Hsieh commented on SPARK-21677:
-

As a given field name {{null}} can't be matched with any field names in json, 
we just output {{null}} as its column value. I think it's reasonable.

> json_tuple throws NullPointException when column is null as string type.
> 
>
> Key: SPARK-21677
> URL: https://issues.apache.org/jira/browse/SPARK-21677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
>
> I was testing {{json_tuple}} before using this to extract values from JSONs 
> in my testing cluster but I found it could actually throw  
> {{NullPointException}} as below sometimes:
> {code}
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show()
> +---+
> | c0|
> +---+
> |224|
> +---+
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show()
> ++
> |  c0|
> ++
> |null|
> ++
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
> ...
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400)
> {code}
> It sounds we should show explicit error messages or return {{NULL}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121679#comment-16121679
 ] 

Sean Owen commented on SPARK-21688:
---

Not a good solution? How about just checking the env variables? Simple and 
better than nothing

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.

2017-08-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121658#comment-16121658
 ] 

Hyukjin Kwon edited comment on SPARK-21677 at 8/10/17 2:12 PM:
---

[~cjm], I was thinking like

{code}
spark.sql("""SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', 
cast(NULL AS STRING), 'a')""").show()

++---++---+
|  c0| c1|  c2| c3|
++---++---+
|null|  2|null|  1|
++---++---+
{code}

I think this could be at least consistent with Hive's implementation:

{code}
hive> SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', cast(NULL 
AS STRING), 'a');
...
NULL2   NULL1
{code}


was (Author: hyukjin.kwon):
[~cjm], I was thinking like

{code}
spark.sql("""SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', 
cast(NULL AS STRING), 'a');""").show()

++---++---+
|  c0| c1|  c2| c3|
++---++---+
|null|  2|null|  1|
++---++---+
{code}

I think this could be at least consistent with Hive's implementation:

{code}
hive> SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', cast(NULL 
AS STRING), 'a');
...
NULL2   NULL1
{code}

> json_tuple throws NullPointException when column is null as string type.
> 
>
> Key: SPARK-21677
> URL: https://issues.apache.org/jira/browse/SPARK-21677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
>
> I was testing {{json_tuple}} before using this to extract values from JSONs 
> in my testing cluster but I found it could actually throw  
> {{NullPointException}} as below sometimes:
> {code}
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show()
> +---+
> | c0|
> +---+
> |224|
> +---+
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show()
> ++
> |  c0|
> ++
> |null|
> ++
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
> ...
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400)
> {code}
> It sounds we should show explicit error messages or return {{NULL}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.

2017-08-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121658#comment-16121658
 ] 

Hyukjin Kwon commented on SPARK-21677:
--

[~cjm], I was thinking like

{code}
spark.sql("""SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', 
cast(NULL AS STRING), 'a');""").show()

++---++---+
|  c0| c1|  c2| c3|
++---++---+
|null|  2|null|  1|
++---++---+
{code}

I think this could be at least consistent with Hive's implementation:

{code}
hive> SELECT json_tuple('{"a":1, "b":2}', cast(NULL AS STRING), 'b', cast(NULL 
AS STRING), 'a');
...
NULL2   NULL1
{code}

> json_tuple throws NullPointException when column is null as string type.
> 
>
> Key: SPARK-21677
> URL: https://issues.apache.org/jira/browse/SPARK-21677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
>
> I was testing {{json_tuple}} before using this to extract values from JSONs 
> in my testing cluster but I found it could actually throw  
> {{NullPointException}} as below sometimes:
> {code}
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show()
> +---+
> | c0|
> +---+
> |224|
> +---+
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show()
> ++
> |  c0|
> ++
> |null|
> ++
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
> ...
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400)
> {code}
> It sounds we should show explicit error messages or return {{NULL}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-10 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-21656:
--
Description: 
Right now with dynamic allocation spark starts by getting the number of 
executors it needs to run all the tasks in parallel (or the configured maximum) 
for that stage.  After it gets that number it will never reacquire more unless 
either an executor dies, is explicitly killed by yarn or it goes to the next 
stage.  The dynamic allocation manager has the concept of idle timeout. 
Currently this says if a task hasn't been scheduled on that executor for a 
configurable amount of time (60 seconds by default), then let that executor go. 
 Note when it lets that executor go due to the idle timeout it never goes back 
to see if it should reacquire more.

This is a problem for multiple reasons:
1 . Things can happen in the system that are not expected that can cause 
delays. Spark should be resilient to these. If the driver is GC'ing, you have 
network delays, etc we could idle timeout executors even though there are tasks 
to run on them its just the scheduler hasn't had time to start those tasks.  
Note that in the worst case this allows the number of executors to go to 0 and 
we have a deadlock.

2. Internal Spark components have opposing requirements. The scheduler has a 
requirement to try to get locality, the dynamic allocation doesn't know about 
this and if it lets the executors go it hurts the scheduler from doing what it 
was designed to do.  For example the scheduler first tries to schedule node 
local, during this time it can skip scheduling on some executors.  After a 
while though the scheduler falls back from node local to scheduler on rack 
local, and then eventually on any node.  So during when the scheduler is doing 
node local scheduling, the other executors can idle timeout.  This means that 
when the scheduler does fall back to rack or any locality where it would have 
used those executors, we have already let them go and it can't scheduler all 
the tasks it could which can have a huge negative impact on job run time.
 
In both of these cases when the executors idle timeout we never go back to 
check to see if we need more executors (until the next stage starts).  In the 
worst case you end up with 0 and deadlock, but generally this shows itself by 
just going down to very few executors when you could have 10's of thousands of 
tasks to run on them, which causes the job to take way more time (in my case 
I've seen it should take minutes and it takes hours due to only been left a few 
executors).  

We should handle these situations in Spark.   The most straight forward 
approach would be to not allow the executors to idle timeout when there are 
tasks that could run on those executors. This would allow the scheduler to do 
its job with locality scheduling.  In doing this it also fixes number 1 above 
because you never can go into a deadlock as it will keep enough executors to 
run all the tasks on. 

There are other approaches to fix this, like explicitly prevent it from going 
to 0 executors, that prevents a deadlock but can still cause the job to 
slowdown greatly.  We could also change it at some point to just re-check to 
see if we should get more executors, but this adds extra logic, we would have 
to decide when to check, its also just overhead in letting them go and then 
re-acquiring them again and this would cause some slowdown in the job as the 
executors aren't immediately there for the scheduler to place things on. 

  was:
Right now spark lets go of executors when they are idle for the 60s (or 
configurable time). I have seen spark let them go when they are idle but they 
were really needed. I have seen this issue when the scheduler was waiting to 
get node locality but that takes longer then the default idle timeout. In these 
jobs the number of executors goes down really small (less than 10) but there 
are still like 80,000 tasks to run.
We should consider not allowing executors to idle timeout if they are still 
needed according to the number of tasks to be run.

There are multiple reasons:
1 . Things can happen in the system that are not expected that can cause 
delays. Spark should be resilient to these. If the driver is GC'ing, you have 
network delays, etc we could idle timeout executors even though there are tasks 
to run on them its just the scheduler hasn't had time to start those tasks. 
these just slow down the users job, the user does not want this.

2. Internal Spark components have opposing requirements. The scheduler has a 
requirement to try to get locality, the dynamic allocation doesn't know about 
this and it giving away executors it hurting the scheduler from doing what it 
was designed to do.
Ideally we have enough executors to run all the tasks on. If dynamic allocation 
allows those to idle timeout the scheduler can not make proper 

[jira] [Commented] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.

2017-08-10 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121641#comment-16121641
 ] 

Jen-Ming Chung commented on SPARK-21677:


to [~hyukjin.kwon], the return {{NULL}} you mentioned does it means all fields 
should be null in json_tuple, or just the non-existence field as shown in the 
following. Thanks!

{code:language=scala|borderStyle=solid}
e.g., spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 
'not_exising_fields')""").show()

+---+---++
| c0| c1|  c2|
+---+---++
|  1|  2|null|
+---+---++
{code}



> json_tuple throws NullPointException when column is null as string type.
> 
>
> Key: SPARK-21677
> URL: https://issues.apache.org/jira/browse/SPARK-21677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
>
> I was testing {{json_tuple}} before using this to extract values from JSONs 
> in my testing cluster but I found it could actually throw  
> {{NullPointException}} as below sometimes:
> {code}
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show()
> +---+
> | c0|
> +---+
> |224|
> +---+
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show()
> ++
> |  c0|
> ++
> |null|
> ++
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
> ...
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400)
> {code}
> It sounds we should show explicit error messages or return {{NULL}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121624#comment-16121624
 ] 

Vincent commented on SPARK-21688:
-

Okay. Yes, true. It can still run without issue but we are just offering 
another choice for those who wanna have 50% speedup or more by using native 
BLAS in their case, they can also stick to F2J with a simple setting in spark 
configuration.

the problem for default thread settings has been discussed in 
https://issues.apache.org/jira/browse/SPARK-21305. I believe it's non-trivial 
but seems it's a common issue for all native blas implementations, there's not 
a good solution to this issue for now.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21656) spark dynamic allocation should not idle timeout executors when tasks still to run

2017-08-10 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-21656:
--
Description: 
Right now spark lets go of executors when they are idle for the 60s (or 
configurable time). I have seen spark let them go when they are idle but they 
were really needed. I have seen this issue when the scheduler was waiting to 
get node locality but that takes longer then the default idle timeout. In these 
jobs the number of executors goes down really small (less than 10) but there 
are still like 80,000 tasks to run.
We should consider not allowing executors to idle timeout if they are still 
needed according to the number of tasks to be run.

There are multiple reasons:
1 . Things can happen in the system that are not expected that can cause 
delays. Spark should be resilient to these. If the driver is GC'ing, you have 
network delays, etc we could idle timeout executors even though there are tasks 
to run on them its just the scheduler hasn't had time to start those tasks. 
these just slow down the users job, the user does not want this.

2. Internal Spark components have opposing requirements. The scheduler has a 
requirement to try to get locality, the dynamic allocation doesn't know about 
this and it giving away executors it hurting the scheduler from doing what it 
was designed to do.
Ideally we have enough executors to run all the tasks on. If dynamic allocation 
allows those to idle timeout the scheduler can not make proper decisions. In 
the end this hurts users by affects the job. A user should not have to mess 
with the configs to keep this basic behavior.

  was:
Right now spark lets go of executors when they are idle for the 60s (or 
configurable time). I have seen spark let them go when they are idle but they 
were really needed. I have seen this issue when the scheduler was waiting to 
get node locality but that takes longer then the default idle timeout. In these 
jobs the number of executors goes down really small (less than 10) but there 
are still like 80,000 tasks to run.
We should consider not allowing executors to idle timeout if they are still 
needed according to the number of tasks to be run.


> spark dynamic allocation should not idle timeout executors when tasks still 
> to run
> --
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.
> There are multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks. these just slow down the users job, the user does not want this.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and it giving away executors it hurting the scheduler from doing what it 
> was designed to do.
> Ideally we have enough executors to run all the tasks on. If dynamic 
> allocation allows those to idle timeout the scheduler can not make proper 
> decisions. In the end this hurts users by affects the job. A user should not 
> have to mess with the configs to keep this basic behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-10 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-21656:
--
Summary: spark dynamic allocation should not idle timeout executors when 
there are enough tasks to run on them  (was: spark dynamic allocation should 
not idle timeout executors when tasks still to run)

> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now spark lets go of executors when they are idle for the 60s (or 
> configurable time). I have seen spark let them go when they are idle but they 
> were really needed. I have seen this issue when the scheduler was waiting to 
> get node locality but that takes longer then the default idle timeout. In 
> these jobs the number of executors goes down really small (less than 10) but 
> there are still like 80,000 tasks to run.
> We should consider not allowing executors to idle timeout if they are still 
> needed according to the number of tasks to be run.
> There are multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks. these just slow down the users job, the user does not want this.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and it giving away executors it hurting the scheduler from doing what it 
> was designed to do.
> Ideally we have enough executors to run all the tasks on. If dynamic 
> allocation allows those to idle timeout the scheduler can not make proper 
> decisions. In the end this hurts users by affects the job. A user should not 
> have to mess with the configs to keep this basic behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121599#comment-16121599
 ] 

Sean Owen commented on SPARK-21688:
---

I mean best case in that MKL might be a little different from OpenBLAS but this 
is minor. 

I suppose this won't impact users with no acceleration. 
What is the slowdown for those who use default thread settings? Because that's 
the most common scenario. If it's non trivial we can't just ignore it. 

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121589#comment-16121589
 ] 

Vincent commented on SPARK-21688:
-

[~srowen] Thanks for your comments. I think if user decides to use native blas, 
they should be aware of the threading configuration impacts, checking this env 
variable in mllib doesnt make sense; and no, actually we didn't just present 
the best-case result, instead, we took the average value of the 3-run tests for 
each case, and the result shows, for small dataset native blas might not have 
advantage over f2j, but the gap is small and we would expect that big data 
processing is more common case here.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21691) Accessing canonicalized plan for query with limit throws exception

2017-08-10 Thread Bjoern Toldbod (JIRA)
Bjoern Toldbod created SPARK-21691:
--

 Summary: Accessing canonicalized plan for query with limit throws 
exception
 Key: SPARK-21691
 URL: https://issues.apache.org/jira/browse/SPARK-21691
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Bjoern Toldbod


Accessing the logical, canonicalized plan fails for queries with limits.
The following demonstrates the issue:

{code:java}
val session = SparkSession.builder.master("local").getOrCreate()

// This works
session.sql("select * from (values 0, 1)").queryExecution.logical.canonicalized

// This fails
session.sql("select * from (values 0, 1) limit 
1").queryExecution.logical.canonicalized
{code}

The message in the thrown exception is somewhat confusing (or at least not 
directly related to the limit):
"Invalid call to toAttribute on unresolved object, tree: *"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2017-08-10 Thread Paul Praet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121401#comment-16121401
 ] 

Paul Praet commented on SPARK-21402:


It seems changing the order of the fields in the struct can give some 
improvements but when I add more fields, the problem just gets worse - some 
fields just never get filled in or twice.

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer

2017-08-10 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Description: 
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects and filters. Deal with that we only need to read 
user needs fields for non-deterministic projects and filters in optimizer.
For example, the fields of project contains nondeterministic function(rand 
function), after a executedPlan optimizer generated:

*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table

HiveTableScan will read all the fields from table. but we only need to ‘d004’ . 
it will affect the performance of task.


  was:
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects and filters. Deal with that we only need to read 
user needs fields for non-deterministic projects and filters in optimizer.



> Improvement a special case for non-deterministic projects and filters in 
> optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, Did a lot of special handling for non-deterministic projects and 
> filters in optimizer. but not good enough. this patch add a new special case 
> for non-deterministic projects and filters. Deal with that we only need to 
> read user needs fields for non-deterministic projects and filters in 
> optimizer.
> For example, the fields of project contains nondeterministic function(rand 
> function), after a executedPlan optimizer generated:
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> HiveTableScan will read all the fields from table. but we only need to ‘d004’ 
> . it will affect the performance of task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20971) Purge the metadata log for FileStreamSource

2017-08-10 Thread Fei Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Shao updated SPARK-20971:
-
Comment: was deleted

(was: Hi Zhu,
What does metadata logs stands for please?)

> Purge the metadata log for FileStreamSource
> ---
>
> Key: SPARK-20971
> URL: https://issues.apache.org/jira/browse/SPARK-20971
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>
> Currently 
> [FileStreamSource.commit|https://github.com/apache/spark/blob/16186cdcbce1a2ec8f839c550e6b571bf5dc2692/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L258]
>  is empty. We can delete unused metadata logs in this method to reduce the 
> size of log files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21402) Java encoders - switch fields on collectAsList

2017-08-10 Thread Paul Praet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121297#comment-16121297
 ] 

Paul Praet edited comment on SPARK-21402 at 8/10/17 9:04 AM:
-

I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- id: string (nullable = false)
 |-- type: string (nullable = false)
 |-- ssid: string (nullable = false)
++--+++
|writeKey|id|type|ssid|
++--+++
|someWriteKey|someId|someType|someSSID|
++--+++
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 ||-- id: string (nullable = false)
 ||-- type: string (nullable = false)
 ||-- ssid: string (nullable = false)

++--+
|writeKey|nodes |
++--+
|someWriteKey|[someId,someType,someSSID]|
++--+
{noformat}

When I do a groupBy on writeKey and a collect_set() on the nodes, we get:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: string (nullable = false)
 |||-- type: string (nullable = false)
 |||-- ssid: string (nullable = false)

+++
|writeKey|nodes   |
+++
|someWriteKey|[[someId,someType,someSSID]]|
+++
{noformat}

When I convert  this to Java...

{code:java}
Dataset dfArray = dfStruct.groupBy("writeKey")
.agg(functions.collect_set("nodes").alias("nodes"));
Encoder topologyEncoder = Encoders.bean(Topology.class);
Dataset datasetMultiple = dfArray.as(topologyEncoder);
System.out.println(datasetMultiple.first());
{code}
This prints:

{noformat}
Topology{writeKey='someWriteKey', nodes=[Node{id='someId', type='someSSID', 
ssid='someType'}]}
{noformat}
You can clearly see the type and ssid fields were swapped.

POJO classes:
{code:java}
 public static class Topology {
private String writeKey;
private List nodes;

public Topology() {
}

public String getWriteKey() {
return writeKey;
}

public void setWriteKey(String writeKey) {
this.writeKey = writeKey;
}

public List getNodes() {
return nodes;
}

public void setNodes(List nodes) {
this.nodes = nodes;
}

@Override
public String toString() {
return "Topology{" +
"writeKey='" + writeKey + '\'' +
", nodes=" + nodes +
'}';
}
}

public static class Node {
private String id;
private String type;
private String ssid;

public Node() {
}

public String getId() {
return id;
}

public void setId(String id) {
this.id = id;
}

public String getType() {
return type;
}

public void setType(String type) {
this.type = type;
}

public String getSsid() {
return ssid;
}

public void setSsid(String ssid) {
this.ssid = ssid;
}


@Override
public String toString() {
return "Node{" +
"id='" + id + '\'' +
", type='" + type + '\'' +
", ssid='" + ssid + '\'' +
'}';
}
}
{code}




was (Author: praetp):
I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- id: string (nullable = false)
 |-- type: string (nullable = false)
 |-- ssid: string (nullable = false)
++--+++
|writeKey|id|type|ssid|
++--+++
|someWriteKey|someId|someType|someSSID|
++--+++
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 ||-- id: string (nullable = false)
 ||-- type: string (nullable = false)
 ||-- ssid: string (nullable = false)


[jira] [Comment Edited] (SPARK-21402) Java encoders - switch fields on collectAsList

2017-08-10 Thread Paul Praet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121297#comment-16121297
 ] 

Paul Praet edited comment on SPARK-21402 at 8/10/17 9:03 AM:
-

I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- id: string (nullable = false)
 |-- type: string (nullable = false)
 |-- ssid: string (nullable = false)
++--+++
|writeKey|id|type|ssid|
++--+++
|someWriteKey|someId|someType|someSSID|
++--+++
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 ||-- id: string (nullable = false)
 ||-- type: string (nullable = false)
 ||-- ssid: string (nullable = false)

++--+
|writeKey|nodes |
++--+
|someWriteKey|[someId,someType,someSSID]|
++--+
{noformat}

When I do a groupBy on writeKey and a collect_set() on the nodes, we get:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: string (nullable = false)
 |||-- type: string (nullable = false)
 |||-- ssid: string (nullable = false)

+++
|writeKey|nodes   |
+++
|someWriteKey|[[someId,someType,someSSID]]|
+++
{noformat}

When I convert  this to Java...

{code:java}
Dataset dfArray = dfStruct.groupBy("writeKey")
.agg(functions.collect_set("nodes").alias("nodes"));
  Encoder topologyEncoder = Encoders.bean(Topology.class);
Dataset datasetMultiple = dfArray.as(topologyEncoder);
System.out.println(datasetMultiple.first());
{code}
This prints:

{noformat}
Topology{writeKey='someWriteKey', nodes=[Node{id='someId', type='someSSID', 
ssid='someType'}]}
{noformat}
You can clearly see the type and ssid fields were swapped.

POJO classes:
{code:java}
 public static class Topology {
private String writeKey;
private List nodes;

public Topology() {
}

public String getWriteKey() {
return writeKey;
}

public void setWriteKey(String writeKey) {
this.writeKey = writeKey;
}

public List getNodes() {
return nodes;
}

public void setNodes(List nodes) {
this.nodes = nodes;
}

@Override
public String toString() {
return "Topology{" +
"writeKey='" + writeKey + '\'' +
", nodes=" + nodes +
'}';
}
}

public static class Node {
private String id;
private String type;
private String ssid;

public Node() {
}

public String getId() {
return id;
}

public void setId(String id) {
this.id = id;
}

public String getType() {
return type;
}

public void setType(String type) {
this.type = type;
}

public String getSsid() {
return ssid;
}

public void setSsid(String ssid) {
this.ssid = ssid;
}


@Override
public String toString() {
return "Node{" +
"id='" + id + '\'' +
", type='" + type + '\'' +
", ssid='" + ssid + '\'' +
'}';
}
}
{code}




was (Author: praetp):
I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.
I have a datamodel like this (all Strings)

{noformat}
++--+++
|writeKey|id|type|ssid|
++--+++
|someWriteKey|someId|someType|someSSID|
++--+++
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 ||-- id: string (nullable = false)
 ||-- type: string (nullable = false)
 ||-- ssid: string (nullable = false)

++--+
|writeKey|nodes |

[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121299#comment-16121299
 ] 

Sean Owen commented on SPARK-21688:
---

Understood, though it potentially impacts the benchmarks. You have a best-case 
result here.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2017-08-10 Thread Paul Praet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121297#comment-16121297
 ] 

Paul Praet commented on SPARK-21402:


I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.
I have a datamodel like this (all Strings)

{noformat}
++--+++
|writeKey|id|type|ssid|
++--+++
|someWriteKey|someId|someType|someSSID|
++--+++
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 ||-- id: string (nullable = false)
 ||-- type: string (nullable = false)
 ||-- ssid: string (nullable = false)

++--+
|writeKey|nodes |
++--+
|someWriteKey|[someId,someType,someSSID]|
++--+
{noformat}

When I do a groupBy on writeKey and a collect_set() on the nodes, we get:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: string (nullable = false)
 |||-- type: string (nullable = false)
 |||-- ssid: string (nullable = false)

+++
|writeKey|nodes   |
+++
|someWriteKey|[[someId,someType,someSSID]]|
+++
{noformat}

When I convert  this to Java...

{code:java}
Dataset dfArray = dfStruct.groupBy("writeKey")
.agg(functions.collect_set("nodes").alias("nodes"));
  Encoder topologyEncoder = Encoders.bean(Topology.class);
Dataset datasetMultiple = dfArray.as(topologyEncoder);
System.out.println(datasetMultiple.first());
{code}
This prints:

{noformat}
Topology{writeKey='someWriteKey', nodes=[Node{id='someId', type='someSSID', 
ssid='someType'}]}
{noformat}
You can clearly see the type and ssid fields were swapped.

POJO classes:
{code:java}
 public static class Topology {
private String writeKey;
private List nodes;

public Topology() {
}

public String getWriteKey() {
return writeKey;
}

public void setWriteKey(String writeKey) {
this.writeKey = writeKey;
}

public List getNodes() {
return nodes;
}

public void setNodes(List nodes) {
this.nodes = nodes;
}

@Override
public String toString() {
return "Topology{" +
"writeKey='" + writeKey + '\'' +
", nodes=" + nodes +
'}';
}
}

public static class Node {
private String id;
private String type;
private String ssid;

public Node() {
}

public String getId() {
return id;
}

public void setId(String id) {
this.id = id;
}

public String getType() {
return type;
}

public void setType(String type) {
this.type = type;
}

public String getSsid() {
return ssid;
}

public void setSsid(String ssid) {
this.ssid = ssid;
}


@Override
public String toString() {
return "Node{" +
"id='" + id + '\'' +
", type='" + type + '\'' +
", ssid='" + ssid + '\'' +
'}';
}
}
{code}



> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> 

[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121284#comment-16121284
 ] 

Peng Meng commented on SPARK-21688:
---

MKL is just an example of native BLAS, if user has Openblas, ATLAS, an so on. 
It also works.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121270#comment-16121270
 ] 

Sean Owen commented on SPARK-21688:
---

I guess my concern is that this slows things down unless people do make the 
threading configuration in the docs. I wonder if it's possible to check whether 
the threading env variable is set correctly and choose native BLAS only if so, 
for these ops?

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121254#comment-16121254
 ] 

Vincent commented on SPARK-21688:
-

and if native blas is left with default multi-threading setting, it could 
impact other ops on JVM, as we found in native-trywait.png in attached file.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-21688:

Attachment: native-trywait.png

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21684) df.write double escaping all the already escaped characters except the first one

2017-08-10 Thread Taran Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taran Saini updated SPARK-21684:

Attachment: SparkQuotesTest2.scala

PFA the same.

> df.write double escaping all the already escaped characters except the first 
> one
> 
>
> Key: SPARK-21684
> URL: https://issues.apache.org/jira/browse/SPARK-21684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
> Attachments: SparkQuotesTest2.scala
>
>
> Hi,
> If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh 
> {noformat}
> Then while writing it is being written as 
> {noformat} "ab\,cd\\,ef\\,gh" {noformat}
> i.e it double escapes all the already escaped commas/delimiters but not the 
> first one.
> This is weird behaviour considering either it should do for all or none.
> If I do mention df.option("escape","") as empty then it solves this problem 
> but the double quotes inside the same value if any are preceded by a special 
> char i.e '\u00'. Why does it do so when the escape character is set as 
> ""(empty)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21684) df.write double escaping all the already escaped characters except the first one

2017-08-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121244#comment-16121244
 ] 

Liang-Chi Hsieh commented on SPARK-21684:
-

Would you mind provide a small codes to reproduce it? Thanks.

> df.write double escaping all the already escaped characters except the first 
> one
> 
>
> Key: SPARK-21684
> URL: https://issues.apache.org/jira/browse/SPARK-21684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
>
> Hi,
> If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh 
> {noformat}
> Then while writing it is being written as 
> {noformat} "ab\,cd\\,ef\\,gh" {noformat}
> i.e it double escapes all the already escaped commas/delimiters but not the 
> first one.
> This is weird behaviour considering either it should do for all or none.
> If I do mention df.option("escape","") as empty then it solves this problem 
> but the double quotes inside the same value if any are preceded by a special 
> char i.e '\u00'. Why does it do so when the escape character is set as 
> ""(empty)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-21688:

Attachment: (was: uni-test on ddot.png)

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, svm1.png, 
> svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-21688:

Attachment: ddot unitest.png

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: ddot unitest.png, mllib svm training.png, svm1.png, 
> svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121236#comment-16121236
 ] 

Vincent commented on SPARK-21688:
-

upload a data we collected before, uni-test on ddot, we can see for data size 
greater than 100, native blas normally has advantages. But if the size is 
smaller than 100, f2j would be a better choice.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png, uni-test on ddot.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-21688:

Attachment: uni-test on ddot.png

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png, uni-test on ddot.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121225#comment-16121225
 ] 

Sean Owen commented on SPARK-21688:
---

I see, so you're saying use BLAS for level 1 ops. Do we know however that the 
user envs will have the right threading config such that this is a performance 
win? how big is it? your benchmark shows gain only on huge inputs, and keep in 
mind most people won't have MKL. What about inputs about size 10 or 100?

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21679) KMeans Clustering is Not Deterministic

2017-08-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21679:
--
Priority: Minor  (was: Major)

As a general statement, it's hard to get deterministic behavior out of a 
distributed implementation. The order that tasks execute can sometimes matter, 
and it's not possible to control every RNG used by every library.

It might be possible in this particular case.

> KMeans Clustering is Not Deterministic
> --
>
> Key: SPARK-21679
> URL: https://issues.apache.org/jira/browse/SPARK-21679
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Christoph Brücke
>Priority: Minor
>
> I’m trying to figure out how to use KMeans in order to achieve reproducible 
> results. I have found that running the same kmeans instance on the same data, 
> with different partitioning will produce different clusterings.
> Given a simple KMeans run with fixed seed returns different results on the 
> same
> training data, if the training data is partitioned differently.
> Consider the following example. The same KMeans clustering set up is run on
> identical data. The only difference is the partitioning of the training data
> (one partition vs. four partitions).
> {noformat}
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.ml.clustering.KMeans
> import org.apache.spark.ml.features.VectorAssembler
> // generate random data for clustering
> val randomData = spark.range(1, 1000).withColumn("a", 
> rand(123)).withColumn("b", rand(321))
> val vecAssembler = new VectorAssembler().setInputCols(Array("a", 
> "b")).setOutputCol("features")
> val data = vecAssembler.transform(randomData)
> // instantiate KMeans with fixed seed
> val kmeans = new KMeans().setK(10).setSeed(9876L)
> // train the model with different partitioning
> val dataWith1Partition = data.repartition(1)
> println("1 Partition: " + 
> kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition))
> val dataWith4Partition = data.repartition(4)
> println("4 Partition: " + 
> kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition))
> {noformat}
> I get the following related cost
> {noformat}
> 1 Partition: 16.028212597888057
> 4 Partition: 16.14758460544976
> {noformat}
> What I want to achieve is that repeated computations of the KMeans Clustering 
> should yield identical result on identical training data, regardless of the 
> partitioning.
> Looking through the Spark source code, I guess the cause is the 
> initialization 
> method of KMeans which in turn uses the `takeSample` method, which does not 
> seem to be partition agnostic.
> Is this behaviour expected? Is there anything I could do to achieve 
> reproducible results?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21679) KMeans Clustering is Not Deterministic

2017-08-10 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-21679:

Issue Type: Improvement  (was: Bug)

> KMeans Clustering is Not Deterministic
> --
>
> Key: SPARK-21679
> URL: https://issues.apache.org/jira/browse/SPARK-21679
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Christoph Brücke
>
> I’m trying to figure out how to use KMeans in order to achieve reproducible 
> results. I have found that running the same kmeans instance on the same data, 
> with different partitioning will produce different clusterings.
> Given a simple KMeans run with fixed seed returns different results on the 
> same
> training data, if the training data is partitioned differently.
> Consider the following example. The same KMeans clustering set up is run on
> identical data. The only difference is the partitioning of the training data
> (one partition vs. four partitions).
> {noformat}
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.ml.clustering.KMeans
> import org.apache.spark.ml.features.VectorAssembler
> // generate random data for clustering
> val randomData = spark.range(1, 1000).withColumn("a", 
> rand(123)).withColumn("b", rand(321))
> val vecAssembler = new VectorAssembler().setInputCols(Array("a", 
> "b")).setOutputCol("features")
> val data = vecAssembler.transform(randomData)
> // instantiate KMeans with fixed seed
> val kmeans = new KMeans().setK(10).setSeed(9876L)
> // train the model with different partitioning
> val dataWith1Partition = data.repartition(1)
> println("1 Partition: " + 
> kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition))
> val dataWith4Partition = data.repartition(4)
> println("4 Partition: " + 
> kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition))
> {noformat}
> I get the following related cost
> {noformat}
> 1 Partition: 16.028212597888057
> 4 Partition: 16.14758460544976
> {noformat}
> What I want to achieve is that repeated computations of the KMeans Clustering 
> should yield identical result on identical training data, regardless of the 
> partitioning.
> Looking through the Spark source code, I guess the cause is the 
> initialization 
> method of KMeans which in turn uses the `takeSample` method, which does not 
> seem to be partition agnostic.
> Is this behaviour expected? Is there anything I could do to achieve 
> reproducible results?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21679) KMeans Clustering is Not Deterministic

2017-08-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121213#comment-16121213
 ] 

Liang-Chi Hsieh commented on SPARK-21679:
-

Old MLlib {{org.apache.spark.mllib.clustering.KMeans}} provides a feature to 
set the initial starting points by {{setInitialModel}} method. I think it can 
provide the deterministic clustering results you need. Looks like currently 
{{ml.clustering.KMeans}} doesn't provide similar feature yet.

> KMeans Clustering is Not Deterministic
> --
>
> Key: SPARK-21679
> URL: https://issues.apache.org/jira/browse/SPARK-21679
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Christoph Brücke
>
> I’m trying to figure out how to use KMeans in order to achieve reproducible 
> results. I have found that running the same kmeans instance on the same data, 
> with different partitioning will produce different clusterings.
> Given a simple KMeans run with fixed seed returns different results on the 
> same
> training data, if the training data is partitioned differently.
> Consider the following example. The same KMeans clustering set up is run on
> identical data. The only difference is the partitioning of the training data
> (one partition vs. four partitions).
> {noformat}
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.ml.clustering.KMeans
> import org.apache.spark.ml.features.VectorAssembler
> // generate random data for clustering
> val randomData = spark.range(1, 1000).withColumn("a", 
> rand(123)).withColumn("b", rand(321))
> val vecAssembler = new VectorAssembler().setInputCols(Array("a", 
> "b")).setOutputCol("features")
> val data = vecAssembler.transform(randomData)
> // instantiate KMeans with fixed seed
> val kmeans = new KMeans().setK(10).setSeed(9876L)
> // train the model with different partitioning
> val dataWith1Partition = data.repartition(1)
> println("1 Partition: " + 
> kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition))
> val dataWith4Partition = data.repartition(4)
> println("4 Partition: " + 
> kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition))
> {noformat}
> I get the following related cost
> {noformat}
> 1 Partition: 16.028212597888057
> 4 Partition: 16.14758460544976
> {noformat}
> What I want to achieve is that repeated computations of the KMeans Clustering 
> should yield identical result on identical training data, regardless of the 
> partitioning.
> Looking through the Spark source code, I guess the cause is the 
> initialization 
> method of KMeans which in turn uses the `takeSample` method, which does not 
> seem to be partition agnostic.
> Is this behaviour expected? Is there anything I could do to achieve 
> reproducible results?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121209#comment-16121209
 ] 

Vincent commented on SPARK-21688:
-

currently, there are certain places in ML/MLLib, such as in mllib/SVM, blas 
operations (dot, axpy, etc..)are bound with f2j, there is no chance to use 
native blas. We understand it was due to performance issue for blas level I api 
to go with F2J, but that's mainly because multi-thread native blas issue, with 
proper settings, we wont be bothered with such issue. So, maybe we should 
change the f2j-binding calls in the current implementation. [~srowen]

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21686) spark.sql.hive.convertMetastoreOrc is causing NullPointerException while reading ORC tables

2017-08-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121187#comment-16121187
 ] 

Liang-Chi Hsieh commented on SPARK-21686:
-

I saw the affect version is 1.6.1. So the more recent versions like 2.0 or 2.2 
don't be affected by this?

> spark.sql.hive.convertMetastoreOrc is causing NullPointerException while 
> reading ORC tables
> ---
>
> Key: SPARK-21686
> URL: https://issues.apache.org/jira/browse/SPARK-21686
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
> Environment: spark_2_4_2_0_258-1.6.1.2.4.2.0-258.el6.noarch 
> spark_2_4_2_0_258-python-1.6.1.2.4.2.0-258.el6.noarch 
> spark_2_4_2_0_258-yarn-shuffle-1.6.1.2.4.2.0-258.el6.noarch
> RHEL-7 (64-Bit)
> JDK 1.8
>Reporter: Ernani Pereira de Mattos Junior
>
> The issue is very similar to SPARK-10304; 
> Spark Query throws a NullPointerException. 
> >>> sqlContext.sql('select * from core_next.spark_categorization').show(57) 
> 17/06/19 11:26:54 ERROR Executor: Exception in task 2.0 in stage 21.0 (TID 
> 48) 
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:488)
>  
> at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:244)
>  
> at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$6.apply(OrcRelation.scala:275)
>  
> at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$6.apply(OrcRelation.scala:275)
>  
> Turn off ORC optimizations and issue was resolved: 
> "sqlContext.setConf("spark.sql.hive.convertMetastoreOrc", "false")



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-10 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121175#comment-16121175
 ] 

Peng Meng commented on SPARK-21680:
---

I mean if the user call toSparse(size), but the size is smaller than 
numNonZero, there maybe problem. 

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21680) ML/MLLIB Vector compressed optimization

2017-08-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121166#comment-16121166
 ] 

Sean Owen commented on SPARK-21680:
---

I don't get what security issue you mean here, but no the change you proposed 
initially is not a good solution. 

> ML/MLLIB Vector compressed optimization
> ---
>
> Key: SPARK-21680
> URL: https://issues.apache.org/jira/browse/SPARK-21680
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> When use Vector.compressed to change a Vector to SparseVector, the 
> performance is very low comparing with Vector.toSparse.
> This is because you have to scan the value three times using 
> Vector.compressed, but you just need two times when use Vector.toSparse.
> When the length of the vector is large, there is significant performance 
> difference between this two method.
> Code of Vector compressed:
> {code:java}
>   def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   toSparse
> } else {
>   toDense
> }
>   }
> {code}
> I propose to change it to:
> {code:java}
> // Some comments here
> def compressed: Vector = {
> val nnz = numNonzeros
> // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs 
> 12 * nnz + 20 bytes.
> if (1.5 * (nnz + 1.0) < size) {
>   val ii = new Array[Int](nnz)
>   val vv = new Array[Double](nnz)
>   var k = 0
>   foreachActive { (i, v) =>
> if (v != 0) {
>   ii(k) = i
>   vv(k) = v
> k += 1
> }
> }
> new SparseVector(size, ii, vv)
> } else {
>   toDense
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121163#comment-16121163
 ] 

Sean Owen commented on SPARK-21688:
---

Of course native BLAS is typically faster where it is used so you should enable 
it. What are you specifically saying beyond that?

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21689) Spark submit will not get kerberos token token when hbase class not found

2017-08-10 Thread zhoukang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121124#comment-16121124
 ] 

zhoukang commented on SPARK-21689:
--

https://github.com/apache/spark/pull/18901 i created a pr for this issue.And i 
wonder why the pr did not related to this issue automatically?

> Spark submit will not get kerberos token token when hbase class not found
> -
>
> Key: SPARK-21689
> URL: https://issues.apache.org/jira/browse/SPARK-21689
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.1.0, 2.2.0
>Reporter: zhoukang
>
> When use yarn cluster mode,and we need scan hbase,there will be a case which 
> can not work:
> If we put user jar on hdfs,when local classpath will has no hbase,which will 
> let get hbase token failed.Then later when job submitted to yarn, it will 
> failed since has no token to access hbase table.I mock three cases:
> 1:user jar is on classpath, and has hbase
> {code:java}
> 17/08/10 13:48:03 INFO security.HadoopFSDelegationTokenProvider: Renewal 
> interval is 86400050 for token HDFS_DELEGATION_TOKEN
> 17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hive
> 17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hbase
> 17/08/10 13:48:05 INFO security.HBaseDelegationTokenProvider: Attempting to 
> fetch HBase security token.
> {code}
> Logs showing we can get token normally.
> 2:user jar on hdfs
> {code:java}
> 17/08/10 13:43:58 WARN security.HBaseDelegationTokenProvider: Class 
> org.apache.hadoop.hbase.HBaseConfiguration not found.
> 17/08/10 13:43:58 INFO security.HBaseDelegationTokenProvider: Failed to get 
> token from service hbase
> java.lang.ClassNotFoundException: 
> org.apache.hadoop.hbase.security.token.TokenUtil
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:41)
>   at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:112)
>   at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:109)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> {code}
> Logs showing we can get token failed with ClassNotFoundException.
> If we download user jar from remote first,then things will work correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21660) Yarn ShuffleService failed to start when the chosen directory become read-only

2017-08-10 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121125#comment-16121125
 ] 

Saisai Shao commented on SPARK-21660:
-

Will yarn NM handle this bad disk problem and return a good disk for 
recoveryPath? I guess yarn should handle this problem.

> Yarn ShuffleService failed to start when the chosen directory become read-only
> --
>
> Key: SPARK-21660
> URL: https://issues.apache.org/jira/browse/SPARK-21660
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 2.1.1
>Reporter: lishuming
>
> h3. Background
> In our production environment,disks corrupt to `read-only` status almost once 
> a month. Now the strategy of Yarn ShuffleService which chooses an available 
> directory(disk) to store Shuffle info(DB) is as 
> below(https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L340):
> 1. If NameNode's recoveryPath not empty and shuffle DB exists in the 
> recoveryPath, return the recoveryPath;
> 2. If recoveryPath empty and shuffle DB exists in 
> `yarn.nodemanager.local-dirs`, set recoveryPath as the existing DB path and 
> return the path;
> 3. If recoveryPath not empty(shuffle DB not exists in the path) and shuffle 
> DB exists in `yarn.nodemanager.local-dirs`, mv the existing shuffle DB to 
> recoveryPath and return the path;
> 4. If all above don't hit, we choose the first disk of 
> `yarn.nodemanager.local-dirs`as the recoveryPath;
> All above strategy don't consider the chosen disk(directory) is writable or 
> not, so in our environment we meet such exception:
> {code:java}
> 2017-06-25 07:15:43,512 ERROR org.apache.spark.network.util.LevelDBProvider: 
> error opening leveldb file /mnt/dfs/12/yarn/local/registeredExecutors.ldb. 
> Creating new file, will not be able to recover state for existing applications
> at 
> org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:48)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167)
> 2017-06-25 07:15:43,514 WARN org.apache.spark.network.util.LevelDBProvider: 
> error deleting /mnt/dfs/12/yarn/local/registeredExecutors.ldb
> 2017-06-25 07:15:43,515 INFO org.apache.hadoop.service.AbstractService: 
> Service spark_shuffle failed in state INITED; cause: java.io.IOException: 
> Unable to create state store
> at 
> org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:77)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:116)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.(ExternalShuffleBlockResolver.java:94)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.(ExternalShuffleBlockHandler.java:66)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167)
> at 
> org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:75)
> {code}
> h3. Consideration
> 1. For many production environment, `yarn.nodemanager.local-dirs` always has 
> more than 1 disk, so we can make a better chosen strategy to avoid the 
> problem above;
> 2. Can we add a strategy to check the DB directory we choose is writable, so 
> avoid the problem above?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121113#comment-16121113
 ] 

Vincent edited comment on SPARK-21688 at 8/10/17 6:13 AM:
--

attach svm profiling data  and training comparison data for both F2J and MKL 
solution


was (Author: vincexie):
profiling

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-10 Thread Vincent (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent updated SPARK-21688:

Attachment: svm1.png
svm2.png
svm-mkl-1.png
svm-mkl-2.png

profiling

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
> Attachments: mllib svm training.png, svm1.png, svm2.png, 
> svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21689) Spark submit will not get kerberos token token when hbase class not found

2017-08-10 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21689:
-
Description: 
When use yarn cluster mode,and we need scan hbase,there will be a case which 
can not work:
If we put user jar on hdfs,when local classpath will has no hbase,which will 
let get hbase token failed.Then later when job submitted to yarn, it will 
failed since has no token to access hbase table.I mock three cases:
1:user jar is on classpath, and has hbase

{code:java}
17/08/10 13:48:03 INFO security.HadoopFSDelegationTokenProvider: Renewal 
interval is 86400050 for token HDFS_DELEGATION_TOKEN
17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hive
17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hbase
17/08/10 13:48:05 INFO security.HBaseDelegationTokenProvider: Attempting to 
fetch HBase security token.
{code}

Logs showing we can get token normally.

2:user jar on hdfs

{code:java}
17/08/10 13:43:58 WARN security.HBaseDelegationTokenProvider: Class 
org.apache.hadoop.hbase.HBaseConfiguration not found.
17/08/10 13:43:58 INFO security.HBaseDelegationTokenProvider: Failed to get 
token from service hbase
java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.security.token.TokenUtil
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:41)
at 
org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:112)
at 
org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:109)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
{code}

Logs showing we can get token failed with ClassNotFoundException.

If we download user jar from remote first,then things will work correctly.



  was:
When use yarn cluster mode,and we need scan hbase,there will be a case which 
can not work:
If we put user jar on hdfs,when local classpath will has no hbase,which will 
let get hbase token failed.Then later when job submitted to yarn, it will 
failed since has no token to access hbase table.I mock three cases:
1:user jar is on classpath, and has hbase

{code:java}
17/08/10 13:48:03 INFO security.HadoopFSDelegationTokenProvider: Renewal 
interval is 86400050 for token HDFS_DELEGATION_TOKEN
17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hive
17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hbase
17/08/10 13:48:05 INFO security.HBaseDelegationTokenProvider: Attempting to 
fetch HBase security token.
{code}

2:user jar on hdfs

{code:java}
17/08/10 13:43:58 WARN security.HBaseDelegationTokenProvider: Class 
org.apache.hadoop.hbase.HBaseConfiguration not found.
17/08/10 13:43:58 INFO security.HBaseDelegationTokenProvider: Failed to get 
token from service hbase
java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.security.token.TokenUtil
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:41)
at 
org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:112)
at 
org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:109)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
{code}

If we download user jar from remote first,then things will work correctly.




> Spark submit will not get kerberos token token when hbase class not found
> -
>
> Key: SPARK-21689
> URL: 

[jira] [Created] (SPARK-21690) one-pass imputer

2017-08-10 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-21690:


 Summary: one-pass imputer
 Key: SPARK-21690
 URL: https://issues.apache.org/jira/browse/SPARK-21690
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.1
Reporter: zhengruifeng


{code}
val surrogates = $(inputCols).map { inputCol =>
  val ic = col(inputCol)
  val filtered = dataset.select(ic.cast(DoubleType))
.filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
  if(filtered.take(1).length == 0) {
throw new SparkException(s"surrogate cannot be computed. " +
  s"All the values in $inputCol are Null, Nan or 
missingValue(${$(missingValue)})")
  }
  val surrogate = $(strategy) match {
case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
case Imputer.median => filtered.stat.approxQuantile(inputCol, 
Array(0.5), 0.001).head
  }
  surrogate
}
{code}

Current impl of {{Imputer}} process one column after after another. In this 
place, we should parallelize the processing in a more efficient way.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >