[jira] [Updated] (SPARK-11723) Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame

2015-11-13 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11723:

Summary: Use LibSVM data source rather than MLUtils.loadLibSVMFile to load 
DataFrame  (was: Use LibSVM data source rather than  to load DataFrame)

> Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame
> ---
>
> Key: SPARK-11723
> URL: https://issues.apache.org/jira/browse/SPARK-11723
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Examples, ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> We prefer to use sqlContext.read.format("libsvm").load(...) rather than 
> MLUtils.loadLibSVMFile(...) to load the data stored in LIBSVM format as a 
> DataFrame, so make this change for all example codes under examples/ml and 
> user guides example code. (only for ml not mllib) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11723) Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11723:


Assignee: (was: Apache Spark)

> Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame
> ---
>
> Key: SPARK-11723
> URL: https://issues.apache.org/jira/browse/SPARK-11723
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Examples, ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> We prefer to use sqlContext.read.format("libsvm").load(...) rather than 
> MLUtils.loadLibSVMFile(...) to load the data stored in LIBSVM format as a 
> DataFrame, so make this change for all example codes under examples/ml and 
> user guides example code. (only for ml not mllib) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-13 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003875#comment-15003875
 ] 

Stavros Kontopoulos commented on SPARK-11638:
-

Ok it makes senseso executors could run in containers too as a predefined 
setup or use spark.mesos.executor.docker.image at the driver side so they are 
launched simply on mesos, each choice affects the test method in the same way? 
I am trying to cover test cases...

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on {{akka-remote}}. In order to 
> take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and 
> 

[jira] [Commented] (SPARK-11617) MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected

2015-11-13 Thread Jacek Lewandowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003879#comment-15003879
 ] 

Jacek Lewandowski commented on SPARK-11617:
---

This also happens in standalone mode, Netty based RPC - I've seen this in 
Master and Worker logs.


> MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected
> ---
>
> Key: SPARK-11617
> URL: https://issues.apache.org/jira/browse/SPARK-11617
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: LingZhou
>
> The problem may be related to
>  [SPARK-11235][NETWORK] Add ability to stream data using network lib.
> while running on yarn-client mode, there are error messages:
> 15/11/09 10:23:55 ERROR util.ResourceLeakDetector: LEAK: ByteBuf.release() 
> was not called before it's garbage-collected. Enable advanced leak reporting 
> to find out where the leak occurred. To enable advanced leak reporting, 
> specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call 
> ResourceLeakDetector.setLevel() See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> and then it will cause 
> cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN 
> for exceeding memory limits. 9.0 GB of 9 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> and WARN scheduler.TaskSetManager: Lost task 105.0 in stage 1.0 (TID 2616, 
> gsr489): java.lang.IndexOutOfBoundsException: index: 130828, length: 16833 
> (expected: range(0, 524288)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-13 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003889#comment-15003889
 ] 

Stavros Kontopoulos commented on SPARK-11638:
-

Ok so the question from my side is that if ips (to setup advertisement ips) are 
injected somehow at the start-up then applying the fix makes this common 
scenario work? I understand completely the intention of something limited as a 
proof of concept, what i try to find out is the extend of use cases covered and 
as a result test scenarios and/or future work if any

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on 

[jira] [Commented] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-13 Thread Radoslaw Gruchalski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003905#comment-15003905
 ] 

Radoslaw Gruchalski commented on SPARK-11638:
-

Exactly, the only "problematic" thing is how to get the ips into the container. 
When submitting a task to mesos/marathon, you submit the task to the mesos 
master, so at the time of submission you don't know where the task is going to 
run. When submitting a task to Marathon, this is what we do at Virdata (pseudo 
code):

- have a file called /etc/agent.sh, this file contains something like:

{noformat}
#!/bin/bash
AGENT_PRIVATE_IP=$(ifconfig ...)
{noformat}

When we submit the task to Marathon (we use Marathon), we do:

{noformat}
{
 ...
  "container": {
"type": "docker",
"docker": ...
  },
  "volumes": {
"containerPath": "/etc/agent.sh",
"hostPath": "/etc/agent.sh",
"mode": "RO"
  }
}
{noformat}

In the container, {{source /etc/agent.sh}}.

In case of the executors having to know the addresses of every agent (so they 
can resolve back to the master), the simplest way would be to generate a file 
like this:

{noformat}
# /etc/mesos-hosts
10.100.1.10mesos-agent1
10.100.1.11mesos-agent2
...
{noformat}

And store it on hdfs. As long as the executor container can read from hdfs, 
you'll be sorted. Again, I think an MVE would be much clearer than this write 
up. Happy to provide such code but it may be difficult today.

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always 

[jira] [Commented] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-13 Thread Radoslaw Gruchalski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003878#comment-15003878
 ] 

Radoslaw Gruchalski commented on SPARK-11638:
-

Indeed, executors can run in docker containers as well. There's a caveat 
though. The executor container must be able to resolve the agent. For this, 
some way of injecting IP addresses into the container is required. This can be 
done but may not be trivial. I am looking at providing minimum viable example 
to be added to this ticket.

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on {{akka-remote}}. In order to 
> take on the 

[jira] [Comment Edited] (SPARK-11638) Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker with Bridge networking

2015-11-13 Thread Radoslaw Gruchalski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003905#comment-15003905
 ] 

Radoslaw Gruchalski edited comment on SPARK-11638 at 11/13/15 12:08 PM:


Exactly, the only "problematic" thing is how to get the ips into the container. 
When submitting a task to mesos/marathon, you submit the task to the mesos 
master, so at the time of submission you don't know where the task is going to 
run. When submitting a task to Marathon, this is what we do at Virdata (pseudo 
code):

- have a file called /etc/agent.sh, this file contains something like:

{noformat}
#!/bin/bash
AGENT_PRIVATE_IP=$(ifconfig ...)
{noformat}

When we submit the task to Marathon (we use Marathon), we do:

{noformat}
{
 ...
  "container": {
"type": "docker",
"docker": ...
  },
  "volumes": {
"containerPath": "/etc/agent.sh",
"hostPath": "/etc/agent.sh",
"mode": "RO"
  }
}
{noformat}

In the container, {{source /etc/agent.sh}}.

The {{/etc/agent.sh}} file needs to exist on every agent node.

In case of the executors having to know the addresses of every agent (so they 
can resolve back to the master), the simplest way would be to generate a file 
like this:

{noformat}
# /etc/mesos-hosts
10.100.1.10mesos-agent1
10.100.1.11mesos-agent2
...
{noformat}

And store it on hdfs. As long as the executor container can read from hdfs, 
you'll be sorted. Again, I think an MVE would be much clearer than this write 
up. Happy to provide such code but it may be difficult today.


was (Author: radekg):
Exactly, the only "problematic" thing is how to get the ips into the container. 
When submitting a task to mesos/marathon, you submit the task to the mesos 
master, so at the time of submission you don't know where the task is going to 
run. When submitting a task to Marathon, this is what we do at Virdata (pseudo 
code):

- have a file called /etc/agent.sh, this file contains something like:

{noformat}
#!/bin/bash
AGENT_PRIVATE_IP=$(ifconfig ...)
{noformat}

When we submit the task to Marathon (we use Marathon), we do:

{noformat}
{
 ...
  "container": {
"type": "docker",
"docker": ...
  },
  "volumes": {
"containerPath": "/etc/agent.sh",
"hostPath": "/etc/agent.sh",
"mode": "RO"
  }
}
{noformat}

In the container, {{source /etc/agent.sh}}.

In case of the executors having to know the addresses of every agent (so they 
can resolve back to the master), the simplest way would be to generate a file 
like this:

{noformat}
# /etc/mesos-hosts
10.100.1.10mesos-agent1
10.100.1.11mesos-agent2
...
{noformat}

And store it on hdfs. As long as the executor container can read from hdfs, 
you'll be sorted. Again, I think an MVE would be much clearer than this write 
up. Happy to provide such code but it may be difficult today.

> Apache Spark in Docker with Bridge networking / run Spark on Mesos, in Docker 
> with Bridge networking
> 
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0-master.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 

[jira] [Commented] (SPARK-11721) The programming guide for Spark SQL in Spark 1.3.0 needs additional imports to work

2015-11-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003701#comment-15003701
 ] 

Sean Owen commented on SPARK-11721:
---

I don't think there will be any further 1.3.x releases so I don't think it 
would be published. But, is there a problem? this import is already in the 
listing:

{code}
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
{code}

> The programming guide for Spark SQL in Spark 1.3.0 needs additional imports 
> to work
> ---
>
> Key: SPARK-11721
> URL: https://issues.apache.org/jira/browse/SPARK-11721
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Neelesh Srinivas Salian
>Priority: Trivial
>
> The documentation in 
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html in the 
> Programmatically Specifying the Schema section needs to add couple more 
> imports to get the example to run.
> Import statements for Row and sql.types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11706) Streaming Python tests cannot report failures

2015-11-13 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-11706.
---
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 1.6.0

> Streaming Python tests cannot report failures
> -
>
> Key: SPARK-11706
> URL: https://issues.apache.org/jira/browse/SPARK-11706
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Tests
>Affects Versions: 1.5.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
> python/pyspark/streaming/tests.py doesn't check the test results. So it 
> always exits with 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11549) Replace example code in mllib-evaluation-metrics.md using include_example

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11549:


Assignee: Apache Spark

> Replace example code in mllib-evaluation-metrics.md using include_example
> -
>
> Key: SPARK-11549
> URL: https://issues.apache.org/jira/browse/SPARK-11549
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Apache Spark
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11549) Replace example code in mllib-evaluation-metrics.md using include_example

2015-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003789#comment-15003789
 ] 

Apache Spark commented on SPARK-11549:
--

User 'vikasnp' has created a pull request for this issue:
https://github.com/apache/spark/pull/9689

> Replace example code in mllib-evaluation-metrics.md using include_example
> -
>
> Key: SPARK-11549
> URL: https://issues.apache.org/jira/browse/SPARK-11549
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11549) Replace example code in mllib-evaluation-metrics.md using include_example

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11549:


Assignee: (was: Apache Spark)

> Replace example code in mllib-evaluation-metrics.md using include_example
> -
>
> Key: SPARK-11549
> URL: https://issues.apache.org/jira/browse/SPARK-11549
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2960) Spark executables fail to start via symlinks

2015-11-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003697#comment-15003697
 ] 

Sean Owen commented on SPARK-2960:
--

I don't think that's a problem. If {{SPARK_HOME}} is explicitly set, it means 
it's intended to be respected as the primary installation. This is for 
production deployments, not development. I don't see a use case where you have 
two simultaneous production deployments on one environment.

> Spark executables fail to start via symlinks
> 
>
> Key: SPARK-2960
> URL: https://issues.apache.org/jira/browse/SPARK-2960
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Shay Rojansky
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 1.6.0
>
>
> The current scripts (e.g. pyspark) fail to run when they are executed via 
> symlinks. A common Linux scenario would be to have Spark installed somewhere 
> (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11721) The programming guide for Spark SQL in Spark 1.3.0 needs additional imports to work

2015-11-13 Thread Neelesh Srinivas Salian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neelesh Srinivas Salian closed SPARK-11721.
---
   Resolution: Implemented
Fix Version/s: 1.3.0

> The programming guide for Spark SQL in Spark 1.3.0 needs additional imports 
> to work
> ---
>
> Key: SPARK-11721
> URL: https://issues.apache.org/jira/browse/SPARK-11721
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Neelesh Srinivas Salian
>Priority: Trivial
> Fix For: 1.3.0
>
>
> The documentation in 
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html in the 
> Programmatically Specifying the Schema section needs to add couple more 
> imports to get the example to run.
> Import statements for Row and sql.types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11601) ML 1.6 QA: API: Binary incompatible changes

2015-11-13 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004216#comment-15004216
 ] 

Xiangrui Meng commented on SPARK-11601:
---

* LogisticAggregator is a package private class, or inside a package private 
object.
* Adding methods to a sealed trait should be a compatible change.
* LeastSquaresAggregator is also a package private class, or inside a package 
private object.

> ML 1.6 QA: API: Binary incompatible changes
> ---
>
> Key: SPARK-11601
> URL: https://issues.apache.org/jira/browse/SPARK-11601
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Tim Hunter
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, ping [~mengxr] for advice since he did it for 
> 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11723) Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11723:
--
Target Version/s: 1.6.0

> Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame
> ---
>
> Key: SPARK-11723
> URL: https://issues.apache.org/jira/browse/SPARK-11723
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Examples, ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> We prefer to use sqlContext.read.format("libsvm").load(...) rather than 
> MLUtils.loadLibSVMFile(...) to load the data stored in LIBSVM format as a 
> DataFrame, so make this change for all example codes under examples/ml and 
> user guides example code. (only for ml not mllib) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11723) Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11723:
--
Assignee: Yanbo Liang

> Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame
> ---
>
> Key: SPARK-11723
> URL: https://issues.apache.org/jira/browse/SPARK-11723
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Examples, ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> We prefer to use sqlContext.read.format("libsvm").load(...) rather than 
> MLUtils.loadLibSVMFile(...) to load the data stored in LIBSVM format as a 
> DataFrame, so make this change for all example codes under examples/ml and 
> user guides example code. (only for ml not mllib) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11723) Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11723.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9690
[https://github.com/apache/spark/pull/9690]

> Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame
> ---
>
> Key: SPARK-11723
> URL: https://issues.apache.org/jira/browse/SPARK-11723
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Examples, ML
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> We prefer to use sqlContext.read.format("libsvm").load(...) rather than 
> MLUtils.loadLibSVMFile(...) to load the data stored in LIBSVM format as a 
> DataFrame, so make this change for all example codes under examples/ml and 
> user guides example code. (only for ml not mllib) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11721) The programming guide for Spark SQL in Spark 1.3.0 needs additional imports to work

2015-11-13 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004265#comment-15004265
 ] 

Neelesh Srinivas Salian commented on SPARK-11721:
-

Didn't work when I tried. Explicitly needed the Row and Type imports.

I do see it in the code though. But it hasn't been published.

I'll close this as resolved.

> The programming guide for Spark SQL in Spark 1.3.0 needs additional imports 
> to work
> ---
>
> Key: SPARK-11721
> URL: https://issues.apache.org/jira/browse/SPARK-11721
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Neelesh Srinivas Salian
>Priority: Trivial
> Fix For: 1.3.0
>
>
> The documentation in 
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html in the 
> Programmatically Specifying the Schema section needs to add couple more 
> imports to get the example to run.
> Import statements for Row and sql.types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11672) Flaky test: ml.JavaDefaultReadWriteSuite

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-11672:
---

Saw another one:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3997/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/

> Flaky test: ml.JavaDefaultReadWriteSuite
> 
>
> Key: SPARK-11672
> URL: https://issues.apache.org/jira/browse/SPARK-11672
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.6.0
>
>
> Saw several failures on Jenkins, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11672) Flaky test: ml.JavaDefaultReadWriteSuite

2015-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004290#comment-15004290
 ] 

Apache Spark commented on SPARK-11672:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/9694

> Flaky test: ml.JavaDefaultReadWriteSuite
> 
>
> Key: SPARK-11672
> URL: https://issues.apache.org/jira/browse/SPARK-11672
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.6.0
>
>
> Saw several failures on Jenkins, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11721) The programming guide for Spark SQL in Spark 1.3.0 needs additional imports to work

2015-11-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004305#comment-15004305
 ] 

Sean Owen commented on SPARK-11721:
---

Hm, what do you mean it hasn't been published? I'm referring to published docs 
for 1.3.0, same as you.
{{Row}} is in this package so it should be imported by this statement no? if 
it's not then something else may be wrong?

> The programming guide for Spark SQL in Spark 1.3.0 needs additional imports 
> to work
> ---
>
> Key: SPARK-11721
> URL: https://issues.apache.org/jira/browse/SPARK-11721
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Neelesh Srinivas Salian
>Priority: Trivial
> Fix For: 1.3.0
>
>
> The documentation in 
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html in the 
> Programmatically Specifying the Schema section needs to add couple more 
> imports to get the example to run.
> Import statements for Row and sql.types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11678) Partition discovery fail if there is a _SUCCESS file in the table's root dir

2015-11-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11678:
-
Fix Version/s: (was: 1.7.0)

> Partition discovery fail if there is a _SUCCESS file in the table's root dir
> 
>
> Key: SPARK-11678
> URL: https://issues.apache.org/jira/browse/SPARK-11678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11728) Replace example code in ml-ensembles.md using include_example

2015-11-13 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-11728:
-

 Summary: Replace example code in ml-ensembles.md using 
include_example
 Key: SPARK-11728
 URL: https://issues.apache.org/jira/browse/SPARK-11728
 Project: Spark
  Issue Type: Sub-task
Reporter: Xusen Yin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-13 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004206#comment-15004206
 ] 

Xiangrui Meng commented on SPARK-11720:
---

Computing average in the normal way should be sufficient. The precision issue 
is different from moment aggregation. Using m_{k+1} = m_{k} + delta/n won't 
help. We need https://en.wikipedia.org/wiki/Kahan_summation_algorithm or an 
equivalent algorithm to be very accurate, but it seems unnecessary in Spark.

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-13 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004206#comment-15004206
 ] 

Xiangrui Meng edited comment on SPARK-11720 at 11/13/15 4:25 PM:
-

Computing average in the normal way should be sufficient. The precision issue 
is different from moment aggregation. Using m' = m + delta/n won't help. We 
need https://en.wikipedia.org/wiki/Kahan_summation_algorithm or an equivalent 
algorithm to be very accurate, but it seems unnecessary in Spark.


was (Author: mengxr):
Computing average in the normal way should be sufficient. The precision issue 
is different from moment aggregation. Using m_{k+1} = m_{k} + delta/n won't 
help. We need https://en.wikipedia.org/wiki/Kahan_summation_algorithm or an 
equivalent algorithm to be very accurate, but it seems unnecessary in Spark.

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11668) R style summary stats in GLM package SparkR

2015-11-13 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004228#comment-15004228
 ] 

Xiangrui Meng commented on SPARK-11668:
---

[~shubhanshumis...@gmail.com] [~yanboliang] implemented std err, t score, and 
p-value in SPARK-11494. But we definitely need to add more summary stats. But 
this JIRA is too broad. So I'm going to close this one and please create JIRAs 
for concrete summary statistics (and group them properly). Thanks!

> R style summary stats in GLM package SparkR
> ---
>
> Key: SPARK-11668
> URL: https://issues.apache.org/jira/browse/SPARK-11668
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.0, 1.5.1
> Environment: LINUX
> WINDOWS
> MAC
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: GLM, sparkr
>
> In the current GLM module in R the `summary(model)` function call will only 
> return the values of the coefficients however in the actual R GLM module, the 
> function also returns the std. err, z score, p-value and confidence intervals 
> for the coefficients as well as some model based statistics like R-squared 
> values, AIC, BIC etc. 
> Another inspiration for adding these metrics can be using the format of 
> python statsmodels package described here: 
> http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11727) split ExpressionEncoder into FlatEncoder and ProductEncoder

2015-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004092#comment-15004092
 ] 

Apache Spark commented on SPARK-11727:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9693

> split ExpressionEncoder into FlatEncoder and ProductEncoder
> ---
>
> Key: SPARK-11727
> URL: https://issues.apache.org/jira/browse/SPARK-11727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11727) split ExpressionEncoder into FlatEncoder and ProductEncoder

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11727:


Assignee: Apache Spark

> split ExpressionEncoder into FlatEncoder and ProductEncoder
> ---
>
> Key: SPARK-11727
> URL: https://issues.apache.org/jira/browse/SPARK-11727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11727) split ExpressionEncoder into FlatEncoder and ProductEncoder

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11727:


Assignee: (was: Apache Spark)

> split ExpressionEncoder into FlatEncoder and ProductEncoder
> ---
>
> Key: SPARK-11727
> URL: https://issues.apache.org/jira/browse/SPARK-11727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9647) MLlib + SparkR integration for 1.6

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9647:
-
Description: 
This is an umbrella JIRA for MLlib + SparkR integration for Spark 1.6, 
continuing the work from SPARK-6805. Some suggested features:

1. model coefficients for logistic regression
2. support feature interactions in RFormula
3. summary statistics -> implemented for linear regression
4. more error families -> 1.7

  was:
This is an umbrella JIRA for MLlib + SparkR integration for Spark 1.6, 
continuing the work from SPARK-6805. Some suggested features:

1. model coefficients for logistic regression
2. support feature interactions in RFormula
3. summary statistics
4. more error families

A more detailed list is TBA.


> MLlib + SparkR integration for 1.6
> --
>
> Key: SPARK-9647
> URL: https://issues.apache.org/jira/browse/SPARK-9647
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for MLlib + SparkR integration for Spark 1.6, 
> continuing the work from SPARK-6805. Some suggested features:
> 1. model coefficients for logistic regression
> 2. support feature interactions in RFormula
> 3. summary statistics -> implemented for linear regression
> 4. more error families -> 1.7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10673) spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions

2015-11-13 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004240#comment-15004240
 ] 

Xin Wu commented on SPARK-10673:


I will look into this one. 

> spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions
> -
>
> Key: SPARK-10673
> URL: https://issues.apache.org/jira/browse/SPARK-10673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> In Spark 1.4, spark.sql.hive.verifyPartitionPath was set to true by default. 
> In Spark 1.5, it is now set to false by default. 
> If a table has a lot of partitions in the underlying filesystem, the code 
> unnecessarily checks for all the underlying directories when executing a 
> query. 
> https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L162
> Structure:
> {code}
> /user/hive/warehouse/table1/year=2015/month=01/
> /user/hive/warehouse/table1/year=2015/month=02/
> /user/hive/warehouse/table1/year=2015/month=03/
> ...
> /user/hive/warehouse/table1/year=2014/month=01/
> /user/hive/warehouse/table1/year=2014/month=02/
> {code}
> If the registered partitions only contain year=2015 when you run "show 
> partitions table1", this code path checks for all directories under the 
> table's root directory. This incurs a significant performance penalty if 
> there are a lot of partition directories. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11721) The programming guide for Spark SQL in Spark 1.3.0 needs additional imports to work

2015-11-13 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004265#comment-15004265
 ] 

Neelesh Srinivas Salian edited comment on SPARK-11721 at 11/13/15 4:48 PM:
---

Didn't work when I tried. Explicitly needed the Row and Type imports.

I do see it in the code though. But it hasn't been published.

I'll close this as resolved.

Thanks [~srowen]


was (Author: neelesh77):
Didn't work when I tried. Explicitly needed the Row and Type imports.

I do see it in the code though. But it hasn't been published.

I'll close this as resolved.

> The programming guide for Spark SQL in Spark 1.3.0 needs additional imports 
> to work
> ---
>
> Key: SPARK-11721
> URL: https://issues.apache.org/jira/browse/SPARK-11721
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.3.0
>Reporter: Neelesh Srinivas Salian
>Priority: Trivial
> Fix For: 1.3.0
>
>
> The documentation in 
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html in the 
> Programmatically Specifying the Schema section needs to add couple more 
> imports to get the example to run.
> Import statements for Row and sql.types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11727) split ExpressionEncoder into FlatEncoder and ProductEncoder

2015-11-13 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11727:
---

 Summary: split ExpressionEncoder into FlatEncoder and 
ProductEncoder
 Key: SPARK-11727
 URL: https://issues.apache.org/jira/browse/SPARK-11727
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11720:
--
Target Version/s: 1.6.0

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11720:
--
Priority: Minor  (was: Major)

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>Priority: Minor
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11445) Replace example code in mllib-ensembles.md using include_example

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11445.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9407
[https://github.com/apache/spark/pull/9407]

> Replace example code in mllib-ensembles.md using include_example
> 
>
> Key: SPARK-11445
> URL: https://issues.apache.org/jira/browse/SPARK-11445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Gabor Liptak
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11678) Partition discovery fail if there is a _SUCCESS file in the table's root dir

2015-11-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11678:
-
Labels: releasenotes  (was: )

> Partition discovery fail if there is a _SUCCESS file in the table's root dir
> 
>
> Key: SPARK-11678
> URL: https://issues.apache.org/jira/browse/SPARK-11678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11678) Partition discovery fail if there is a _SUCCESS file in the table's root dir

2015-11-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004316#comment-15004316
 ] 

Yin Huai commented on SPARK-11678:
--

We need to document the newly added {{basePath}} option.

> Partition discovery fail if there is a _SUCCESS file in the table's root dir
> 
>
> Key: SPARK-11678
> URL: https://issues.apache.org/jira/browse/SPARK-11678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11729) Replace example code in ml-linear-methods.md and ml-ann.md using include_example

2015-11-13 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-11729:
-

 Summary: Replace example code in ml-linear-methods.md and 
ml-ann.md using include_example
 Key: SPARK-11729
 URL: https://issues.apache.org/jira/browse/SPARK-11729
 Project: Spark
  Issue Type: Sub-task
Reporter: Xusen Yin


Process these two markdown files in one JIRA issue because they have fewer code 
changes than other files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-13 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004210#comment-15004210
 ] 

Xiangrui Meng commented on SPARK-11720:
---

If we don't have Decimal.NaN implemented, it is okay to return null. I don't 
think we want to promise consistency in future releases. Let's try the best we 
can in 1.6 and see feedback from users.

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11445) Replace example code in mllib-ensembles.md using include_example

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11445:
--
Assignee: Rishabh Bhardwaj

> Replace example code in mllib-ensembles.md using include_example
> 
>
> Key: SPARK-11445
> URL: https://issues.apache.org/jira/browse/SPARK-11445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Gabor Liptak
>Assignee: Rishabh Bhardwaj
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11693) spark kafka direct streaming exception

2015-11-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004072#comment-15004072
 ] 

Cody Koeninger commented on SPARK-11693:


You've under-provisioned Kafka storage and / or Spark compute capacity.
The result is that data is being deleted before it has been processed.
I personally think the proper response to a system being broken is for it to 
obviously break in a noticeable way, rather than silently giving the wrong 
result.
My recommended way to handle this would be to monitor your stream, and have a 
restart policy that's appropriate for your situation.

If you want to modify the area of the code you noted to silently catch the 
exception and start at the next available offset, you can do so pretty 
straightforwardly (streaming-kafka is an external module so you shouldn't have 
to re-deploy all of spark).  I don't think that's a modification that makes 
sense for the general use case however.

> spark kafka direct streaming exception
> --
>
> Key: SPARK-11693
> URL: https://issues.apache.org/jira/browse/SPARK-11693
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: xiaoxiaoluo
>Priority: Minor
>
> We are using spark kafka direct streaming in our test enviroment. We have 
> limited the kafka partition size to avoid to exhaust the disk space.So when 
> the speed of data writing to kafka faster than the speed of spark streaming 
> reading data. There will be some exception in spark streaming, and the 
> application will be shut down.
> {noformat}
> 15/11/11 10:17:35 ERROR Executor: Exception in task 0.3 in stage 1626659.0 
> (TID 1134180)
> kafka.common.OffsetOutOfRangeException
>   at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown 
> Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at java.lang.Class.newInstance(Class.java:442)
>   at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86)
>   at 
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184)
>   at 
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
>   at 
> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 15/11/11 10:17:42 ERROR CoarseGrainedExecutorBackend: Driver 10.1.92.44:49939 
> disassociated! Shutting down.
> {noformat}
> Could streaming get the current smallest offset from this partition? and go 
> on to process streaming data?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9647) MLlib + SparkR integration for 1.6

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9647.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Marked this umbrella as done. We will track follow-up work in a separate JIRA.

> MLlib + SparkR integration for 1.6
> --
>
> Key: SPARK-9647
> URL: https://issues.apache.org/jira/browse/SPARK-9647
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.6.0
>
>
> This is an umbrella JIRA for MLlib + SparkR integration for Spark 1.6, 
> continuing the work from SPARK-6805. Some suggested features:
> 1. model coefficients for logistic regression
> 2. support feature interactions in RFormula
> 3. summary statistics -> implemented for linear regression
> 4. more error families -> 1.7



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11668) R style summary stats in GLM package SparkR

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11668.
---
Resolution: Duplicate

> R style summary stats in GLM package SparkR
> ---
>
> Key: SPARK-11668
> URL: https://issues.apache.org/jira/browse/SPARK-11668
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.0, 1.5.1
> Environment: LINUX
> WINDOWS
> MAC
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: GLM, sparkr
>
> In the current GLM module in R the `summary(model)` function call will only 
> return the values of the coefficients however in the actual R GLM module, the 
> function also returns the std. err, z score, p-value and confidence intervals 
> for the coefficients as well as some model based statistics like R-squared 
> values, AIC, BIC etc. 
> Another inspiration for adding these metrics can be using the format of 
> python statsmodels package described here: 
> http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor

2015-11-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004527#comment-15004527
 ] 

Reynold Xin commented on SPARK-8029:


[~davies]  can you update the jira ticket description with the high level 
approach used in the fix?

> ShuffleMapTasks must be robust to concurrent attempts on the same executor
> --
>
> Key: SPARK-8029
> URL: https://issues.apache.org/jira/browse/SPARK-8029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Imran Rashid
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.0
>
> Attachments: 
> AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf
>
>
> When stages get retried, a task may have more than one attempt running at the 
> same time, on the same executor.  Currently this causes problems for 
> ShuffleMapTasks, since all attempts try to write to the same output files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11724) Casting integer types to timestamp has unexpected semantics

2015-11-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11724:

Labels: releasenotes  (was: )

> Casting integer types to timestamp has unexpected semantics
> ---
>
> Key: SPARK-11724
> URL: https://issues.apache.org/jira/browse/SPARK-11724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Nong Li
>Priority: Minor
>  Labels: releasenotes
>
> Casting from integer types to timestamp treats the source int as being in 
> millis. Casting from timestamp to integer types creates the result in 
> seconds. This leads to behavior like:
> {code}
> scala> sql("select cast(cast (1234 as timestamp) as bigint)").show
> +---+
> |_c0|
> +---+
> |  1|
> +---+
> {code}
> Double's on the other hand treat it as seconds when casting to and from:
> {code}
> scala> sql("select cast(cast (1234.5 as timestamp) as double)").show
> +--+
> |   _c0|
> +--+
> |1234.5|
> +--+
> {code}
> This also breaks some other functions which return long in seconds, in 
> particular, unix_timestamp.
> {code}
> scala> sql("select cast(unix_timestamp() as timestamp)").show
> ++
> | _c0|
> ++
> |1970-01-17 10:03:...|
> ++
> scala> sql("select cast(unix_timestamp() *1000 as timestamp)").show
> ++
> | _c0|
> ++
> |2015-11-12 23:26:...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11727) split ExpressionEncoder into FlatEncoder and ProductEncoder

2015-11-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11727.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9693
[https://github.com/apache/spark/pull/9693]

> split ExpressionEncoder into FlatEncoder and ProductEncoder
> ---
>
> Key: SPARK-11727
> URL: https://issues.apache.org/jira/browse/SPARK-11727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2344:
---

Assignee: Apache Spark

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Assignee: Apache Spark
>Priority: Minor
>  Labels: clustering
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11734) Move reference sort into test and standardize on TungstenSort

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11734:


Assignee: Apache Spark  (was: Reynold Xin)

> Move reference sort into test and standardize on TungstenSort
> -
>
> Key: SPARK-11734
> URL: https://issues.apache.org/jira/browse/SPARK-11734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11734) Move reference sort into test and standardize on TungstenSort

2015-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004597#comment-15004597
 ] 

Apache Spark commented on SPARK-11734:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9700

> Move reference sort into test and standardize on TungstenSort
> -
>
> Key: SPARK-11734
> URL: https://issues.apache.org/jira/browse/SPARK-11734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11734) Move reference sort into test and standardize on TungstenSort

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11734:


Assignee: Reynold Xin  (was: Apache Spark)

> Move reference sort into test and standardize on TungstenSort
> -
>
> Key: SPARK-11734
> URL: https://issues.apache.org/jira/browse/SPARK-11734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6990) Add Java linting script

2015-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004374#comment-15004374
 ] 

Apache Spark commented on SPARK-6990:
-

User 'dskrvk' has created a pull request for this issue:
https://github.com/apache/spark/pull/9696

> Add Java linting script
> ---
>
> Key: SPARK-6990
> URL: https://issues.apache.org/jira/browse/SPARK-6990
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Reporter: Josh Rosen
>Priority: Minor
>  Labels: starter
>
> It would be nice to add a {{dev/lint-java}} script to enforce style rules for 
> Spark's Java code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11601) ML 1.6 QA: API: Binary incompatible changes

2015-11-13 Thread Tim Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004419#comment-15004419
 ] 

Tim Hunter commented on SPARK-11601:


Oh I see, two of them are false positives (SPARK-11732) and I have a fix for 
them. The last one is a false positive from the MiMa tool itself that does not 
check that the interface {{LogisticRegressionSummary}} is sealed, and therefore 
complains about adding new methods to it.

> ML 1.6 QA: API: Binary incompatible changes
> ---
>
> Key: SPARK-11601
> URL: https://issues.apache.org/jira/browse/SPARK-11601
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Tim Hunter
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, ping [~mengxr] for advice since he did it for 
> 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10863) Method coltypes() to return the R column types of a DataFrame

2015-11-13 Thread Oscar D. Lara Yejas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004363#comment-15004363
 ] 

Oscar D. Lara Yejas commented on SPARK-10863:
-

[~felixcheung] Let me try to clarify a bit.

As suggested by [~shivaram], I implemented a fallback mechanism so that if 
there's no corresponding mapping from a Spark type into R's (i.e., mapping is 
NA), the same R type is returned.

The reason for this is that, in my opinion, having coltypes(df) return NA's 
would be a bit confusing from the user perspective. What would an NA type mean? 
Type not set or data inconsistency come to my mind if I were in the user's 
shoes.

I believe it all depends on the type of operations we want to support on 
Columns. For example, if the user wants to do:

df$column1 + 3
!df$colum2
grep(df$column, "regex")
df$column4 / df$column5

column1, column4, and column5 must be numeric/integer, column2 must be logical, 
and column3 must be character.

Now, what kind of operations are we planning to support on Array, Struct, and 
Map types? Depending on that we could map them to lists/environment or leave 
them as they are right now.

Hope this helps clarify, and let me know your thoughts.

Thanks!



> Method coltypes() to return the R column types of a DataFrame
> -
>
> Key: SPARK-10863
> URL: https://issues.apache.org/jira/browse/SPARK-10863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11732) MiMa excludes miss private classes

2015-11-13 Thread Tim Hunter (JIRA)
Tim Hunter created SPARK-11732:
--

 Summary: MiMa excludes miss private classes
 Key: SPARK-11732
 URL: https://issues.apache.org/jira/browse/SPARK-11732
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.1
Reporter: Tim Hunter
 Fix For: 1.6.0


The checks in GenerateMIMAIgnore only check for package private classes, not 
private classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11732) MiMa excludes miss private classes

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11732:


Assignee: Apache Spark

> MiMa excludes miss private classes
> --
>
> Key: SPARK-11732
> URL: https://issues.apache.org/jira/browse/SPARK-11732
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Tim Hunter
>Assignee: Apache Spark
>  Labels: newbie
> Fix For: 1.6.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The checks in GenerateMIMAIgnore only check for package private classes, not 
> private classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11454) DB2 dialect - map DB2 ROWID and TIMESTAMP with TIMEZONE types into valid Spark types

2015-11-13 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004488#comment-15004488
 ] 

Suresh Thalamati commented on SPARK-11454:
--

I am looking into fixing this Jira  along with SPARK-10655 PR. 

> DB2 dialect - map DB2 ROWID and TIMESTAMP with TIMEZONE types into valid 
> Spark types
> 
>
> Key: SPARK-11454
> URL: https://issues.apache.org/jira/browse/SPARK-11454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Pallavi Priyadarshini
>Priority: Minor
>
> Load of DB2 data types (ROWID and TIMESTAMP with TIMEZONE) into Spark 
> DataFrames fails. 
> Plan is to map them to Spark IntegerType and TimestampType respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11731) Enable batching on Driver WriteAheadLog by default

2015-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004356#comment-15004356
 ] 

Apache Spark commented on SPARK-11731:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/9695

> Enable batching on Driver WriteAheadLog by default
> --
>
> Key: SPARK-11731
> URL: https://issues.apache.org/jira/browse/SPARK-11731
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>
> Using batching on the driver for the WriteAheadLog should be an improvement 
> for all environments and use cases. Users will be able to scale to much 
> higher number of receivers with the BatchedWriteAheadLog. Therefore we should 
> turn it on by default, and QA it in the QA period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11731) Enable batching on Driver WriteAheadLog by default

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11731:


Assignee: (was: Apache Spark)

> Enable batching on Driver WriteAheadLog by default
> --
>
> Key: SPARK-11731
> URL: https://issues.apache.org/jira/browse/SPARK-11731
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>
> Using batching on the driver for the WriteAheadLog should be an improvement 
> for all environments and use cases. Users will be able to scale to much 
> higher number of receivers with the BatchedWriteAheadLog. Therefore we should 
> turn it on by default, and QA it in the QA period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11731) Enable batching on Driver WriteAheadLog by default

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11731:


Assignee: Apache Spark

> Enable batching on Driver WriteAheadLog by default
> --
>
> Key: SPARK-11731
> URL: https://issues.apache.org/jira/browse/SPARK-11731
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> Using batching on the driver for the WriteAheadLog should be an improvement 
> for all environments and use cases. Users will be able to scale to much 
> higher number of receivers with the BatchedWriteAheadLog. Therefore we should 
> turn it on by default, and QA it in the QA period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10863) Method coltypes() to return the R column types of a DataFrame

2015-11-13 Thread Oscar D. Lara Yejas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004363#comment-15004363
 ] 

Oscar D. Lara Yejas edited comment on SPARK-10863 at 11/13/15 5:58 PM:
---

[~felixcheung] Let me try to clarify a bit.

As suggested by [~shivaram], I implemented a fallback mechanism so that if 
there's no corresponding mapping from a Spark type into R's (i.e., mapping is 
NA), the same R type is returned.

The reason for this is that, in my opinion, having coltypes(df) return NA's 
would be a bit confusing from the user perspective. What would an NA type mean? 
Type not set or data inconsistency come to my mind if I were in the user's 
shoes.

I believe it all depends on the type of operations we want to support on 
Columns. For example, if the user wants to do:

df$column1 + 3
!df$colum2
grep(df$column3, "regex")
df$column4 / df$column5

column1, column4, and column5 must be numeric/integer, column2 must be logical, 
and column3 must be character.

Now, what kind of operations are we planning to support on Array, Struct, and 
Map types? Depending on that we could map them to lists/environment or I could 
fix it so that instead of returning map, for example, I could 
return map.

Hope this helps clarify, and let me know your thoughts.

Thanks!




was (Author: olarayej):
[~felixcheung] Let me try to clarify a bit.

As suggested by [~shivaram], I implemented a fallback mechanism so that if 
there's no corresponding mapping from a Spark type into R's (i.e., mapping is 
NA), the same R type is returned.

The reason for this is that, in my opinion, having coltypes(df) return NA's 
would be a bit confusing from the user perspective. What would an NA type mean? 
Type not set or data inconsistency come to my mind if I were in the user's 
shoes.

I believe it all depends on the type of operations we want to support on 
Columns. For example, if the user wants to do:

df$column1 + 3
!df$colum2
grep(df$column3, "regex")
df$column4 / df$column5

column1, column4, and column5 must be numeric/integer, column2 must be logical, 
and column3 must be character.

Now, what kind of operations are we planning to support on Array, Struct, and 
Map types? Depending on that we could map them to lists/environment or leave 
them as they are right now.

Hope this helps clarify, and let me know your thoughts.

Thanks!



> Method coltypes() to return the R column types of a DataFrame
> -
>
> Key: SPARK-10863
> URL: https://issues.apache.org/jira/browse/SPARK-10863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11731) Enable batching on Driver WriteAheadLog by default

2015-11-13 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-11731:
---

 Summary: Enable batching on Driver WriteAheadLog by default
 Key: SPARK-11731
 URL: https://issues.apache.org/jira/browse/SPARK-11731
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Burak Yavuz


Using batching on the driver for the WriteAheadLog should be an improvement for 
all environments and use cases. Users will be able to scale to much higher 
number of receivers with the BatchedWriteAheadLog. Therefore we should turn it 
on by default, and QA it in the QA period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11730) Feature Importance for GBT

2015-11-13 Thread Brian Webb (JIRA)
Brian Webb created SPARK-11730:
--

 Summary: Feature Importance for GBT
 Key: SPARK-11730
 URL: https://issues.apache.org/jira/browse/SPARK-11730
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Brian Webb


Random Forests have feature importance, but GBT do not. It would be great if we 
can add feature importance to GBT as well. Perhaps the code in Random Forests 
can be refactored to apply to both types of ensembles.

See https://issues.apache.org/jira/browse/SPARK-5133



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11733) Allow shuffle readers to request data from just one mapper

2015-11-13 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-11733:
-

 Summary: Allow shuffle readers to request data from just one mapper
 Key: SPARK-11733
 URL: https://issues.apache.org/jira/browse/SPARK-11733
 Project: Spark
  Issue Type: Sub-task
Reporter: Matei Zaharia


This is needed to do broadcast joins. Right now the shuffle reader interface 
takes a range of reduce IDs but fetches from all maps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10408) Autoencoder

2015-11-13 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-10408:
-
Description: 
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


References: 
1. Vincent, Pascal, et al. "Extracting and composing robust features with 
denoising autoencoders." Proceedings of the 25th international conference on 
Machine learning. ACM, 2008. 
http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
 
2. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
(2010). Stacked denoising autoencoders: Learning useful representations in a 
deep network with a local denoising criterion. Journal of Machine Learning 
Research, 11(3371–3408). 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." 
Advances in neural information processing systems 19 (2007): 153. 
http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf

  was:
Goal: Implement various types of autoencoders 
Requirements:
1)Basic (deep) autoencoder that supports different types of inputs: binary, 
real in [0..1]. real in [-inf, +inf] 
2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature to 
the MLP and then used here 
3)Denoising autoencoder 
4)Stacked autoencoder for pre-training of deep networks. It should support 
arbitrary network layers


References: 
1, 2. 
http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
(2010). Stacked denoising autoencoders: Learning useful representations in a 
deep network with a local denoising criterion. Journal of Machine Learning 
Research, 11(3371–3408). 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." 
Advances in neural information processing systems 19 (2007): 153. 
http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf


> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11724) Casting integer types to timestamp has unexpected semantics

2015-11-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11724:

Assignee: Nong Li

> Casting integer types to timestamp has unexpected semantics
> ---
>
> Key: SPARK-11724
> URL: https://issues.apache.org/jira/browse/SPARK-11724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Nong Li
>Assignee: Nong Li
>Priority: Minor
>  Labels: releasenotes
>
> Casting from integer types to timestamp treats the source int as being in 
> millis. Casting from timestamp to integer types creates the result in 
> seconds. This leads to behavior like:
> {code}
> scala> sql("select cast(cast (1234 as timestamp) as bigint)").show
> +---+
> |_c0|
> +---+
> |  1|
> +---+
> {code}
> Double's on the other hand treat it as seconds when casting to and from:
> {code}
> scala> sql("select cast(cast (1234.5 as timestamp) as double)").show
> +--+
> |   _c0|
> +--+
> |1234.5|
> +--+
> {code}
> This also breaks some other functions which return long in seconds, in 
> particular, unix_timestamp.
> {code}
> scala> sql("select cast(unix_timestamp() as timestamp)").show
> ++
> | _c0|
> ++
> |1970-01-17 10:03:...|
> ++
> scala> sql("select cast(unix_timestamp() *1000 as timestamp)").show
> ++
> | _c0|
> ++
> |2015-11-12 23:26:...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2344:
---

Assignee: (was: Apache Spark)

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>  Labels: clustering
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004589#comment-15004589
 ] 

Apache Spark commented on SPARK-2344:
-

User 'acflorea' has created a pull request for this issue:
https://github.com/apache/spark/pull/9699

> Add Fuzzy C-Means algorithm to MLlib
> 
>
> Key: SPARK-2344
> URL: https://issues.apache.org/jira/browse/SPARK-2344
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Alex
>Priority: Minor
>  Labels: clustering
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
> FCM is very similar to K - Means which is already implemented, and they 
> differ only in the degree of relationship each point has with each cluster:
> (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
> As part of the implementation I would like:
> - create a base class for K- Means and FCM
> - implement the relationship for each algorithm differently (in its class)
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10863) Method coltypes() to return the R column types of a DataFrame

2015-11-13 Thread Oscar D. Lara Yejas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004363#comment-15004363
 ] 

Oscar D. Lara Yejas edited comment on SPARK-10863 at 11/13/15 5:54 PM:
---

[~felixcheung] Let me try to clarify a bit.

As suggested by [~shivaram], I implemented a fallback mechanism so that if 
there's no corresponding mapping from a Spark type into R's (i.e., mapping is 
NA), the same R type is returned.

The reason for this is that, in my opinion, having coltypes(df) return NA's 
would be a bit confusing from the user perspective. What would an NA type mean? 
Type not set or data inconsistency come to my mind if I were in the user's 
shoes.

I believe it all depends on the type of operations we want to support on 
Columns. For example, if the user wants to do:

df$column1 + 3
!df$colum2
grep(df$column3, "regex")
df$column4 / df$column5

column1, column4, and column5 must be numeric/integer, column2 must be logical, 
and column3 must be character.

Now, what kind of operations are we planning to support on Array, Struct, and 
Map types? Depending on that we could map them to lists/environment or leave 
them as they are right now.

Hope this helps clarify, and let me know your thoughts.

Thanks!




was (Author: olarayej):
[~felixcheung] Let me try to clarify a bit.

As suggested by [~shivaram], I implemented a fallback mechanism so that if 
there's no corresponding mapping from a Spark type into R's (i.e., mapping is 
NA), the same R type is returned.

The reason for this is that, in my opinion, having coltypes(df) return NA's 
would be a bit confusing from the user perspective. What would an NA type mean? 
Type not set or data inconsistency come to my mind if I were in the user's 
shoes.

I believe it all depends on the type of operations we want to support on 
Columns. For example, if the user wants to do:

df$column1 + 3
!df$colum2
grep(df$column, "regex")
df$column4 / df$column5

column1, column4, and column5 must be numeric/integer, column2 must be logical, 
and column3 must be character.

Now, what kind of operations are we planning to support on Array, Struct, and 
Map types? Depending on that we could map them to lists/environment or leave 
them as they are right now.

Hope this helps clarify, and let me know your thoughts.

Thanks!



> Method coltypes() to return the R column types of a DataFrame
> -
>
> Key: SPARK-10863
> URL: https://issues.apache.org/jira/browse/SPARK-10863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11732) MiMa excludes miss private classes

2015-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004431#comment-15004431
 ] 

Apache Spark commented on SPARK-11732:
--

User 'thunterdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/9697

> MiMa excludes miss private classes
> --
>
> Key: SPARK-11732
> URL: https://issues.apache.org/jira/browse/SPARK-11732
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Tim Hunter
>  Labels: newbie
> Fix For: 1.6.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The checks in GenerateMIMAIgnore only check for package private classes, not 
> private classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11732) MiMa excludes miss private classes

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11732:


Assignee: (was: Apache Spark)

> MiMa excludes miss private classes
> --
>
> Key: SPARK-11732
> URL: https://issues.apache.org/jira/browse/SPARK-11732
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Tim Hunter
>  Labels: newbie
> Fix For: 1.6.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The checks in GenerateMIMAIgnore only check for package private classes, not 
> private classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11690) Add pivot to python api

2015-11-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11690:
-
Assignee: Andrew Ray

> Add pivot to python api
> ---
>
> Key: SPARK-11690
> URL: https://issues.apache.org/jira/browse/SPARK-11690
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Andrew Ray
>Assignee: Andrew Ray
>Priority: Minor
> Fix For: 1.6.0
>
>
> Add pivot method to the python api GroupedData class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11690) Add pivot to python api

2015-11-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11690.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9653
[https://github.com/apache/spark/pull/9653]

> Add pivot to python api
> ---
>
> Key: SPARK-11690
> URL: https://issues.apache.org/jira/browse/SPARK-11690
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Andrew Ray
>Priority: Minor
> Fix For: 1.6.0
>
>
> Add pivot method to the python api GroupedData class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11734) Move reference sort into test and standardize on TungstenSort

2015-11-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11734:
---

 Summary: Move reference sort into test and standardize on 
TungstenSort
 Key: SPARK-11734
 URL: https://issues.apache.org/jira/browse/SPARK-11734
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9762) ALTER TABLE cannot find column

2015-11-13 Thread Shaun A Elliott (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004359#comment-15004359
 ] 

Shaun A Elliott commented on SPARK-9762:


Is there a workaround for this at all?

> ALTER TABLE cannot find column
> --
>
> Key: SPARK-9762
> URL: https://issues.apache.org/jira/browse/SPARK-9762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>
> {{ALTER TABLE tbl CHANGE}} cannot find a column that {{DESCRIBE COLUMN}} 
> lists. 
> In the case of a table generated with {{HiveContext.read.json()}}, the output 
> of {{DESCRIBE dimension_components}} is:
> {code}
> comp_config   
> struct
> comp_criteria string
> comp_data_model   string
> comp_dimensions   
> struct
> comp_disabled boolean
> comp_id   bigint
> comp_path string
> comp_placementDatastruct
> comp_slot_types   array
> {code}
> However, {{alter table dimension_components change comp_dimensions 
> comp_dimensions 
> struct;}}
>  fails with:
> {code}
> 15/08/08 23:13:07 ERROR exec.DDLTask: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Invalid column reference 
> comp_dimensions
>   at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3584)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:312)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473)
> ...
> {code}
> Meanwhile, {{SHOW COLUMNS in dimension_components}} lists two columns: 
> {{col}} (which does not exist in the table) and {{z}}, which was just added.
> This suggests that DDL operations in Spark SQL use table metadata 
> inconsistently.
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/636a25d6074a03aafa67].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10863) Method coltypes() to return the R column types of a DataFrame

2015-11-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004442#comment-15004442
 ] 

Felix Cheung commented on SPARK-10863:
--

I understand but I think this is problematic in several ways:
1. As I've stated, it creates inconsistency with some in R types and some in 
Scala/JVM type names
2. I think it is confusing to call things "map" since this doesn't exist in R
3. It breaks reversibility with coltypes<-
{code}
coltypes(df) <- coltypes(df)  # this doesn't work if map is 
returned, and I don't think we should expect user to know map<>
{code}

This is why I have suggested originally to add a show.atomic.type parameter to 
coltypes().


> Method coltypes() to return the R column types of a DataFrame
> -
>
> Key: SPARK-10863
> URL: https://issues.apache.org/jira/browse/SPARK-10863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Oscar D. Lara Yejas
>Assignee: Oscar D. Lara Yejas
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11643) inserting date with leading zero inserts null example '0001-12-10'

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11643:


Assignee: Apache Spark  (was: Davies Liu)

> inserting date with leading zero inserts null example '0001-12-10'
> --
>
> Key: SPARK-11643
> URL: https://issues.apache.org/jira/browse/SPARK-11643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Chip Sands
>Assignee: Apache Spark
>
> inserting date with leading zero inserts null value, example '0001-12-10'.
> This worked until 1.5/1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11643) inserting date with leading zero inserts null example '0001-12-10'

2015-11-13 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-11643:
--

Assignee: Davies Liu

> inserting date with leading zero inserts null example '0001-12-10'
> --
>
> Key: SPARK-11643
> URL: https://issues.apache.org/jira/browse/SPARK-11643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Chip Sands
>Assignee: Davies Liu
>
> inserting date with leading zero inserts null value, example '0001-12-10'.
> This worked until 1.5/1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11643) inserting date with leading zero inserts null example '0001-12-10'

2015-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004622#comment-15004622
 ] 

Apache Spark commented on SPARK-11643:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9701

> inserting date with leading zero inserts null example '0001-12-10'
> --
>
> Key: SPARK-11643
> URL: https://issues.apache.org/jira/browse/SPARK-11643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Chip Sands
>Assignee: Davies Liu
>
> inserting date with leading zero inserts null value, example '0001-12-10'.
> This worked until 1.5/1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor

2015-11-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8029:
-
Target Version/s: 1.5.3, 1.6.0  (was: 1.5.2, 1.6.0)

> ShuffleMapTasks must be robust to concurrent attempts on the same executor
> --
>
> Key: SPARK-8029
> URL: https://issues.apache.org/jira/browse/SPARK-8029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Imran Rashid
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
> Attachments: 
> AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf
>
>
> When stages get retried, a task may have more than one attempt running at the 
> same time, on the same executor.  Currently this causes problems for 
> ShuffleMapTasks, since all attempts try to write to the same output files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11720:


Assignee: (was: Apache Spark)

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>Priority: Minor
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004791#comment-15004791
 ] 

Apache Spark commented on SPARK-11720:
--

User 'JihongMA' has created a pull request for this issue:
https://github.com/apache/spark/pull/9705

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>Priority: Minor
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7308) Should there be multiple concurrent attempts for one stage?

2015-11-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7308:
-
Assignee: Davies Liu

> Should there be multiple concurrent attempts for one stage?
> ---
>
> Key: SPARK-7308
> URL: https://issues.apache.org/jira/browse/SPARK-7308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Imran Rashid
>Assignee: Davies Liu
> Fix For: 1.5.3, 1.6.0
>
> Attachments: SPARK-7308_discussion.pdf
>
>
> Currently, when there is a fetch failure, you can end up with multiple 
> concurrent attempts for the same stage.  Is this intended?  At best, it leads 
> to some very confusing behavior, and it makes it hard for the user to make 
> sense of what is going on.  At worst, I think this is cause of some very 
> strange errors we've seen errors we've seen from users, where stages start 
> executing before all the dependent stages have completed.
> This can happen in the following scenario:  there is a fetch failure in 
> attempt 0, so the stage is retried.  attempt 1 starts.  But, tasks from 
> attempt 0 are still running -- some of them can also hit fetch failures after 
> attempt 1 starts.  That will cause additional stage attempts to get fired up.
> There is an attempt to handle this already 
> https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
> but that only checks whether the **stage** is running.  It really should 
> check whether that **attempt** is still running, but there isn't enough info 
> to do that.  
> I'll also post some info on how to reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11740) Fix DStream checkpointing logic to prevent failures during checkpoint recovery

2015-11-13 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-11740:


 Summary: Fix DStream checkpointing logic to prevent failures 
during checkpoint recovery
 Key: SPARK-11740
 URL: https://issues.apache.org/jira/browse/SPARK-11740
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu


We will do checkpoint when generating a batch and completing a batch. When the 
processing time of a batch is greater than the batch interval, checkpointing 
for completing an old batch may run after checkpointing of a new batch. If this 
happens, checkpoint of an old batch actually has the latest information, but we 
won't recovery from it. Then we may see some RDD checkpoint file missing 
exception during checkpoint recovery. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor

2015-11-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8029:
-
Description: 
When stages get retried, a task may have more than one attempt running at the 
same time, on the same executor.  Currently this causes problems for 
ShuffleMapTasks, since all attempts try to write to the same output files.

This is finally resolved through https://github.com/apache/spark/pull/9610, 
which uses the first writer wins approach.

  was:
When stages get retried, a task may have more than one attempt running at the 
same time, on the same executor.  Currently this causes problems for 
ShuffleMapTasks, since all attempts try to write to the same output files.

This is resolved through 


> ShuffleMapTasks must be robust to concurrent attempts on the same executor
> --
>
> Key: SPARK-8029
> URL: https://issues.apache.org/jira/browse/SPARK-8029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Imran Rashid
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
> Attachments: 
> AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf
>
>
> When stages get retried, a task may have more than one attempt running at the 
> same time, on the same executor.  Currently this causes problems for 
> ShuffleMapTasks, since all attempts try to write to the same output files.
> This is finally resolved through https://github.com/apache/spark/pull/9610, 
> which uses the first writer wins approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7829) SortShuffleWriter writes inconsistent data & index files on stage retry

2015-11-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-7829.
--
  Resolution: Fixed
Assignee: Davies Liu  (was: Imran Rashid)
   Fix Version/s: 1.6.0
  1.5.3
Target Version/s: 1.5.3, 1.6.0

> SortShuffleWriter writes inconsistent data & index files on stage retry
> ---
>
> Key: SPARK-7829
> URL: https://issues.apache.org/jira/browse/SPARK-7829
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.3.1
>Reporter: Imran Rashid
>Assignee: Davies Liu
> Fix For: 1.5.3, 1.6.0
>
>
> When a stage is retried, even if a shuffle map task was successful, it may 
> get retried in any case.  If it happens to get scheduled on the same 
> executor, the old data file is *appended*, while the index file still assumes 
> the data starts in position 0.  This leads to an apparently corrupt shuffle 
> map output, since when the data file is read, the index file points to the 
> wrong location.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor

2015-11-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8029:
-
Description: 
When stages get retried, a task may have more than one attempt running at the 
same time, on the same executor.  Currently this causes problems for 
ShuffleMapTasks, since all attempts try to write to the same output files.

This is resolved through 

  was:When stages get retried, a task may have more than one attempt running at 
the same time, on the same executor.  Currently this causes problems for 
ShuffleMapTasks, since all attempts try to write to the same output files.


> ShuffleMapTasks must be robust to concurrent attempts on the same executor
> --
>
> Key: SPARK-8029
> URL: https://issues.apache.org/jira/browse/SPARK-8029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Imran Rashid
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
> Attachments: 
> AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf
>
>
> When stages get retried, a task may have more than one attempt running at the 
> same time, on the same executor.  Currently this causes problems for 
> ShuffleMapTasks, since all attempts try to write to the same output files.
> This is resolved through 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7829) SortShuffleWriter writes inconsistent data & index files on stage retry

2015-11-13 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004780#comment-15004780
 ] 

Andrew Or commented on SPARK-7829:
--

I believe this is now fixed due to https://github.com/apache/spark/pull/9610. 
Let me know if this is not the case.

> SortShuffleWriter writes inconsistent data & index files on stage retry
> ---
>
> Key: SPARK-7829
> URL: https://issues.apache.org/jira/browse/SPARK-7829
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.3.1
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 1.5.3, 1.6.0
>
>
> When a stage is retried, even if a shuffle map task was successful, it may 
> get retried in any case.  If it happens to get scheduled on the same 
> executor, the old data file is *appended*, while the index file still assumes 
> the data starts in position 0.  This leads to an apparently corrupt shuffle 
> map output, since when the data file is read, the index file points to the 
> wrong location.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10712) JVM crashes with spark.sql.tungsten.enabled = true

2015-11-13 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004636#comment-15004636
 ] 

Davies Liu commented on SPARK-10712:


How is you small table looks like? Does 1.5.2-RC2 still have this issue?

> JVM crashes with spark.sql.tungsten.enabled = true
> --
>
> Key: SPARK-10712
> URL: https://issues.apache.org/jira/browse/SPARK-10712
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
> Environment: 1 node - Linux, 64GB ram, 8 core
>Reporter: Mauro Pirrone
>Priority: Critical
>
> When turning on tungsten, I get the following error when executing a 
> query/job with a few joins. When tungsten is turned off, the error does not 
> appear. Also note that tungsten works for me in other cases.
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ffadaf59200, pid=7598, tid=140710015645440
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_45-b14) (build 
> 1.8.0_45-b14)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.45-b02 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x7eb200]
> #
> # Core dump written. Default location: //core or core.7598 (max size 100 
> kB). To ensure a full core dump, try "ulimit -c unlimited" before starting 
> Java again
> #
> # An error report file with more information is saved as:
> # //hs_err_pid7598.log
> Compiled method (nm)   44403 10436 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x7ffac6b49290,0x7ffac6b495f8] = 872
>  relocation [0x7ffac6b493b8,0x7ffac6b49400] = 72
>  main code  [0x7ffac6b49400,0x7ffac6b495f8] = 504
> Compiled method (nm)   44403 10436 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x7ffac6b49290,0x7ffac6b495f8] = 872
>  relocation [0x7ffac6b493b8,0x7ffac6b49400] = 72
>  main code  [0x7ffac6b49400,0x7ffac6b495f8] = 504
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> ---  T H R E A D  ---
> Current thread (0x7ff7902e7800):  JavaThread "broadcast-hash-join-1" 
> daemon [_thread_in_vm, id=16548, stack(0x7ff66bd98000,0x7ff66be99000)]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 
> 0x00069f572b10
> Registers:
> RAX=0x00069f672b08, RBX=0x7ff7902e7800, RCX=0x000394132140, 
> RDX=0xfffe0004
> RSP=0x7ff66be97048, RBP=0x7ff66be970a0, RSI=0x000394032148, 
> RDI=0x00069f572b10
> R8 =0x7ff66be970d0, R9 =0x0028, R10=0x7ff79cc0e1e7, 
> R11=0x7ff79cc0e198
> R12=0x7ff66be970c0, R13=0x7ff66be970d0, R14=0x0028, 
> R15=0x30323048
> RIP=0x7ff7b0dae200, EFLAGS=0x00010282, CSGSFS=0xe033, 
> ERR=0x0004
>   TRAPNO=0x000e
> Top of Stack: (sp=0x7ff66be97048)
> 0x7ff66be97048:   7ff7b1042b1a 7ff7902e7800
> 0x7ff66be97058:   7ff7 7ff7902e7800
> 0x7ff66be97068:   7ff7902e7800 7ff7ad2846a0
> 0x7ff66be97078:   7ff7897048d8 
> 0x7ff66be97088:   7ff66be97110 7ff66be971f0
> 0x7ff66be97098:   7ff7902e7800 7ff66be970f0
> 0x7ff66be970a8:   7ff79cc0e261 0010
> 0x7ff66be970b8:   000390c04048 00066f24fac8
> 0x7ff66be970c8:   7ff7902e7800 000394032120
> 0x7ff66be970d8:   7ff7902e7800 7ff66f971af0
> 0x7ff66be970e8:   7ff7902e7800 7ff66be97198
> 0x7ff66be970f8:   7ff79c9d4c4d 7ff66a454b10
> 0x7ff66be97108:   7ff79c9d4c4d 0010
> 0x7ff66be97118:   7ff7902e5a90 0028
> 0x7ff66be97128:   7ff79c9d4760 000394032120
> 0x7ff66be97138:   30323048 7ff66be97160
> 0x7ff66be97148:   00066f24fac8 000390c04048
> 0x7ff66be97158:   7ff66be97158 7ff66f978eeb
> 0x7ff66be97168:   7ff66be971f0 7ff66f9791c8
> 0x7ff66be97178:   7ff668e90c60 7ff66f978f60
> 0x7ff66be97188:   7ff66be97110 7ff66be971b8
> 0x7ff66be97198:   7ff66be97238 7ff79c9d4c4d
> 0x7ff66be971a8:   0010 
> 0x7ff66be971b8:   38363130 38363130
> 0x7ff66be971c8:   0028 7ff66f973388
> 0x7ff66be971d8:   000394032120 30323048
> 0x7ff66be971e8:   000665823080 00066f24fac8
> 0x7ff66be971f8:   7ff66be971f8 7ff66f973357
> 0x7ff66be97208:   7ff66be97260 7ff66f976fe0
> 0x7ff66be97218:    7ff66f973388
> 0x7ff66be97228:   7ff66be971b8 

[jira] [Commented] (SPARK-11737) String may not be serialized correctly with Kyro

2015-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004734#comment-15004734
 ] 

Apache Spark commented on SPARK-11737:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9704

> String may not be serialized correctly with Kyro
> 
>
> Key: SPARK-11737
> URL: https://issues.apache.org/jira/browse/SPARK-11737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> When run in cluster mode, the driver may have different memory (and configs) 
> than executor, also if Kyro is used, then string can not be collected back to 
> driver:
> {code}
> >>> sqlContext.range(10).selectExpr("repeat(cast(id as string), 9)").show()
> ++
> |repeat(cast(id as string),9)|
> ++
> | 0|
> | 1|
> | 2|
> | 3|
> | 4|
> | 5|
> | 6|
> | 7|
> | 8|
> | 9|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11737) String may not be serialized correctly with Kyro

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11737:


Assignee: Apache Spark  (was: Davies Liu)

> String may not be serialized correctly with Kyro
> 
>
> Key: SPARK-11737
> URL: https://issues.apache.org/jira/browse/SPARK-11737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>Priority: Critical
>
> When run in cluster mode, the driver may have different memory (and configs) 
> than executor, also if Kyro is used, then string can not be collected back to 
> driver:
> {code}
> >>> sqlContext.range(10).selectExpr("repeat(cast(id as string), 9)").show()
> ++
> |repeat(cast(id as string),9)|
> ++
> | 0|
> | 1|
> | 2|
> | 3|
> | 4|
> | 5|
> | 6|
> | 7|
> | 8|
> | 9|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11336) Include path to the source file in generated example code

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11336:
--
Summary: Include path to the source file in generated example code  (was: 
Include a link to the source file in generated example code)

> Include path to the source file in generated example code
> -
>
> Key: SPARK-11336
> URL: https://issues.apache.org/jira/browse/SPARK-11336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
> Fix For: 1.6.0
>
>
> It would be nice to include a link to the example source file at the bottom 
> of each code example. So if users want to try them, they know where to find. 
> The font size should be small and not interrupting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11336) Include a link to the source file in generated example code

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11336.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9320
[https://github.com/apache/spark/pull/9320]

> Include a link to the source file in generated example code
> ---
>
> Key: SPARK-11336
> URL: https://issues.apache.org/jira/browse/SPARK-11336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
> Fix For: 1.6.0
>
>
> It would be nice to include a link to the example source file at the bottom 
> of each code example. So if users want to try them, they know where to find. 
> The font size should be small and not interrupting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11336) Include path to the source file in generated example code

2015-11-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11336:
--
Description: It would be nice to include -a link- the path to the example 
source file at the bottom of each code example. So if users want to try them, 
they know where to find. The font size should be small and not interrupting.  
(was: It would be nice to include a link to the example source file at the 
bottom of each code example. So if users want to try them, they know where to 
find. The font size should be small and not interrupting.)

> Include path to the source file in generated example code
> -
>
> Key: SPARK-11336
> URL: https://issues.apache.org/jira/browse/SPARK-11336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
> Fix For: 1.6.0
>
>
> It would be nice to include -a link- the path to the example source file at 
> the bottom of each code example. So if users want to try them, they know 
> where to find. The font size should be small and not interrupting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11720:


Assignee: Apache Spark

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jihong MA
>Assignee: Apache Spark
>Priority: Minor
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11735) Add a check in the constructor of SqlContext to make sure the SparkContext is not stopped

2015-11-13 Thread Yin Huai (JIRA)
Yin Huai created SPARK-11735:


 Summary: Add a check in the constructor of SqlContext to make sure 
the SparkContext is not stopped
 Key: SPARK-11735
 URL: https://issues.apache.org/jira/browse/SPARK-11735
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11736) Add MonotonicallyIncreasingID to function registry

2015-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11736:


Assignee: Apache Spark  (was: Yin Huai)

> Add MonotonicallyIncreasingID to function registry
> --
>
> Key: SPARK-11736
> URL: https://issues.apache.org/jira/browse/SPARK-11736
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor

2015-11-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-8029:
-
Fix Version/s: (was: 1.5.2)
   1.5.3

> ShuffleMapTasks must be robust to concurrent attempts on the same executor
> --
>
> Key: SPARK-8029
> URL: https://issues.apache.org/jira/browse/SPARK-8029
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Imran Rashid
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
> Attachments: 
> AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf
>
>
> When stages get retried, a task may have more than one attempt running at the 
> same time, on the same executor.  Currently this causes problems for 
> ShuffleMapTasks, since all attempts try to write to the same output files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >