[jira] [Comment Edited] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-08-22 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432186#comment-15432186
 ] 

Yun Ni edited comment on SPARK-5992 at 8/23/16 5:50 AM:


Hi,

We are engineers from Uber. Here is our design doc for LSH:
https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit

Please take a look and let us know if this meets your requirements or not.

Thanks,
Yun Ni


was (Author: yunn):
Hi,

We are engineers from Uber. Here is our design doc for LSH:
https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit

Please take a look and let us know if this meets your requirements or not.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-08-22 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432186#comment-15432186
 ] 

Yun Ni commented on SPARK-5992:
---

Hi,

We are engineers from Uber. Here is our design doc for LSH:
https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit

Please take a look and let us know if this meets your requirements or not.

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11638) Run Spark on Mesos with bridge networking

2016-08-22 Thread Liam Fisk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432163#comment-15432163
 ] 

Liam Fisk commented on SPARK-11638:
---

Just mirroring what I said on SPARK-4563 about the lack of support for bridged 
networking:

{quote}
It also makes life difficult for OSX users. Docker for Mac uses xhyve to 
virtualize the docker engine 
(https://docs.docker.com/engine/installation/mac/), and thus `--net=host` binds 
to the VM's network instead of the true OSX host. The SPARK_LOCAL_IP ends up as 
172.17.0.2, which is not externally contactable.

The end result is OSX users cannot containerize Spark if Spark needs to contact 
a mesos cluster.
{quote}

While you are unlikely to have Spark running on an OSX machine in production, 
the development experience is a bit painful if you have to run a separate VM 
with public networking.

> Run Spark on Mesos with bridge networking
> -
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 

[jira] [Commented] (SPARK-11638) Run Spark on Mesos with bridge networking

2016-08-22 Thread Liam Fisk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432151#comment-15432151
 ] 

Liam Fisk commented on SPARK-11638:
---

This ticket appears to be related to SPARK-4563. 

> Run Spark on Mesos with bridge networking
> -
>
> Key: SPARK-11638
> URL: https://issues.apache.org/jira/browse/SPARK-11638
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: Radoslaw Gruchalski
> Attachments: 1.4.0.patch, 1.4.1.patch, 1.5.0.patch, 1.5.1.patch, 
> 1.5.2.patch, 1.6.0.patch, 2.3.11.patch, 2.3.4.patch
>
>
> h4. Summary
> Provides {{spark.driver.advertisedPort}}, 
> {{spark.fileserver.advertisedPort}}, {{spark.broadcast.advertisedPort}} and 
> {{spark.replClassServer.advertisedPort}} settings to enable running Spark in 
> Mesos on Docker with Bridge networking. Provides patches for Akka Remote to 
> enable Spark driver advertisement using alternative host and port.
> With these settings, it is possible to run Spark Master in a Docker container 
> and have the executors running on Mesos talk back correctly to such Master.
> The problem is discussed on the Mesos mailing list here: 
> https://mail-archives.apache.org/mod_mbox/mesos-user/201510.mbox/%3CCACTd3c9vjAMXk=bfotj5ljzfrh5u7ix-ghppfqknvg9mkkc...@mail.gmail.com%3E
> h4. Running Spark on Mesos - LIBPROCESS_ADVERTISE_IP opens the door
> In order for the framework to receive orders in the bridged container, Mesos 
> in the container has to register for offers using the IP address of the 
> Agent. Offers are sent by Mesos Master to the Docker container running on a 
> different host, an Agent. Normally, prior to Mesos 0.24.0, {{libprocess}} 
> would advertise itself using the IP address of the container, something like 
> {{172.x.x.x}}. Obviously, Mesos Master can't reach that address, it's a 
> different host, it's a different machine. Mesos 0.24.0 introduced two new 
> properties for {{libprocess}} - {{LIBPROCESS_ADVERTISE_IP}} and 
> {{LIBPROCESS_ADVERTISE_PORT}}. This allows the container to use the Agent's 
> address to register for offers. This was provided mainly for running Mesos in 
> Docker on Mesos.
> h4. Spark - how does the above relate and what is being addressed here?
> Similar to Mesos, out of the box, Spark does not allow to advertise its 
> services on ports different than bind ports. Consider following scenario:
> Spark is running inside a Docker container on Mesos, it's a bridge networking 
> mode. Assuming a port {{}} for the {{spark.driver.port}}, {{6677}} for 
> the {{spark.fileserver.port}}, {{6688}} for the {{spark.broadcast.port}} and 
> {{23456}} for the {{spark.replClassServer.port}}. If such task is posted to 
> Marathon, Mesos will give 4 ports in range {{31000-32000}} mapping to the 
> container ports. Starting the executors from such container results in 
> executors not being able to communicate back to the Spark Master.
> This happens because of 2 things:
> Spark driver is effectively an {{akka-remote}} system with {{akka.tcp}} 
> transport. {{akka-remote}} prior to version {{2.4}} can't advertise a port 
> different to what it bound to. The settings discussed are here: 
> https://github.com/akka/akka/blob/f8c1671903923837f22d0726a955e0893add5e9f/akka-remote/src/main/resources/reference.conf#L345-L376.
>  These do not exist in Akka {{2.3.x}}. Spark driver will always advertise 
> port {{}} as this is the one {{akka-remote}} is bound to.
> Any URIs the executors contact the Spark Master on, are prepared by Spark 
> Master and handed over to executors. These always contain the port number 
> used by the Master to find the service on. The services are:
> - {{spark.broadcast.port}}
> - {{spark.fileserver.port}}
> - {{spark.replClassServer.port}}
> all above ports are by default {{0}} (random assignment) but can be specified 
> using Spark configuration ( {{-Dspark...port}} ). However, they are limited 
> in the same way as the {{spark.driver.port}}; in the above example, an 
> executor should not contact the file server on port {{6677}} but rather on 
> the respective 31xxx assigned by Mesos.
> Spark currently does not allow any of that.
> h4. Taking on the problem, step 1: Spark Driver
> As mentioned above, Spark Driver is based on {{akka-remote}}. In order to 
> take on the problem, the {{akka.remote.net.tcp.bind-hostname}} and 
> {{akka.remote.net.tcp.bind-port}} settings are a must. Spark does not compile 
> with Akka 2.4.x yet.
> What we want is the back port of mentioned {{akka-remote}} settings to 
> {{2.3.x}} versions. These patches are attached to this ticket - 
> {{2.3.4.patch}} and {{2.3.11.patch}} files provide patches for respective 
> akka versions. These add mentioned settings and ensure they 

[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-08-22 Thread Liam Fisk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432149#comment-15432149
 ] 

Liam Fisk commented on SPARK-4563:
--

This ticket appears to be related to SPARK-11638

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Priority: Minor
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-22 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432129#comment-15432129
 ] 

Jason Moore commented on SPARK-17195:
-

> I think it might be sensible to support this by implementing 
> `SchemaRelationProvider`.

That was one of the thoughts I had too.

> it should be fixed in Teradata

I more or less agree with this, but I'm not certain their JDBC driver is being 
provided with everything they need from the server to decide that it should be 
"columnNullableUnknown".  Maybe I'll shoot some more questions their way on 
this.

> Dealing with JDBC column nullability when it is not reliable
> 
>
> Key: SPARK-17195
> URL: https://issues.apache.org/jira/browse/SPARK-17195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jason Moore
>
> Starting with Spark 2.0.0, the column "nullable" property is important to 
> have correct for the code generation to work properly.  Marking the column as 
> nullable = false used to (<2.0.0) allow null values to be operated on, but 
> now this will result in:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {noformat}
> I'm all for the change towards a more ridged behavior (enforcing correct 
> input).  But the problem I'm facing now is that when I used JDBC to read from 
> a Teradata server, the column nullability is often not correct (particularly 
> when sub-queries are involved).
> This is the line in question:
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140
> I'm trying to work out what would be the way forward for me on this.  I know 
> that it's really the fault of the Teradata database server not returning the 
> correct schema, but I'll need to make Spark itself or my application 
> resilient to this behavior.
> One of the Teradata JDBC Driver tech leads has told me that "when the 
> rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
> string, then the other metadata values may not be completely accurate" - so 
> one option could be to treat the nullability (at least) the same way as the 
> "unknown" case (as nullable = true).  For reference, see the rest of our 
> discussion here: 
> http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability
> Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-08-22 Thread Liam Fisk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432128#comment-15432128
 ] 

Liam Fisk commented on SPARK-4563:
--

It also makes life difficult for OSX users. Docker for Mac uses xhyve to 
virtualize the docker engine 
(https://docs.docker.com/engine/installation/mac/), and thus `--net=host` binds 
to the VM's network instead of the true OSX host. The SPARK_LOCAL_IP ends up as 
172.17.0.2, which is not externally contactable. 

The end result is OSX users cannot containerize Spark if Spark needs to contact 
a mesos cluster.

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Priority: Minor
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-08-22 Thread Reece Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432109#comment-15432109
 ] 

Reece Robinson commented on SPARK-4563:
---

+1 I also disagree with the minor rating. This is essential for our production 
strategy that we containerise our workloads.

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Priority: Minor
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432103#comment-15432103
 ] 

Hyukjin Kwon commented on SPARK-17195:
--

I think if Teradata JDBC gives an incorrect schema (about nullability), it 
should be fixed in Teradata. I mean.. JDBC is a protocol and we might not have 
to consider the cases where the implementation does not complies the protocol.
I see there are some cases for nullability in schema, 
`ResultSetMetaData.columnNullableUnknown`, `ResultSetMetaData.columnNullable` 
and `ResultSetMetaData.columnNoNulls`. If the nullability is "unknown", then it 
should set `ResultSetMetaData.columnNullableUnknown`.

Another thought is to make JDBC data source to accept user's defined schema. If 
the "inferred" schema is not reliable in many cases, I think it might be 
sensible to support this by implementing `SchemaRelationProvider`.


> Dealing with JDBC column nullability when it is not reliable
> 
>
> Key: SPARK-17195
> URL: https://issues.apache.org/jira/browse/SPARK-17195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jason Moore
>
> Starting with Spark 2.0.0, the column "nullable" property is important to 
> have correct for the code generation to work properly.  Marking the column as 
> nullable = false used to (<2.0.0) allow null values to be operated on, but 
> now this will result in:
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {noformat}
> I'm all for the change towards a more ridged behavior (enforcing correct 
> input).  But the problem I'm facing now is that when I used JDBC to read from 
> a Teradata server, the column nullability is often not correct (particularly 
> when sub-queries are involved).
> This is the line in question:
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140
> I'm trying to work out what would be the way forward for me on this.  I know 
> that it's really the fault of the Teradata database server not returning the 
> correct schema, but I'll need to make Spark itself or my application 
> resilient to this behavior.
> One of the Teradata JDBC Driver tech leads has told me that "when the 
> rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
> string, then the other metadata values may not be completely accurate" - so 
> one option could be to treat the nullability (at least) the same way as the 
> "unknown" case (as nullable = true).  For reference, see the rest of our 
> discussion here: 
> http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability
> Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-08-22 Thread Liam Fisk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432089#comment-15432089
 ] 

Liam Fisk commented on SPARK-4563:
--

Further to my comment, I disagree with the "minor" rating of this issue. 
Without this feature, Spark cannot be containerized in a production 
environment, as --net=host is not an option when multiple containers exist.

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Priority: Minor
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-08-22 Thread Liam Fisk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432086#comment-15432086
 ] 

Liam Fisk commented on SPARK-4563:
--

+1 for me

I am running into this problem when I run `spark-submit --master 
mesos://zk://foo:2181/mesos ` in a Docker container.

As Spark is in the container, the SPARK_LOCAL_IP will resolve to 172.17.0.3 or 
similar, and the mesos executors will fail to contact this address. If 
SPARK_ADVERTISED_IP existed then I would broadcast the ip of the host system.

I cannot use host networking (as this container will inhabit multi-tenanted 
infrastructure).

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Priority: Minor
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17198) ORC fixed char literal filter does not work

2016-08-22 Thread tuming (JIRA)
tuming created SPARK-17198:
--

 Summary: ORC fixed char literal filter does not work
 Key: SPARK-17198
 URL: https://issues.apache.org/jira/browse/SPARK-17198
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: tuming


I have got wrong result when I run the following query in SparkSQL. 
select * from orc_table where char_col ='5LZS';

Table orc_table is a ORC format table.
Column char_col is defined as char(6). 

The hive record reader will return a char(6) string to the spark. And the spark 
has no fixed char type. All fixed char type attributes are converted to String 
by default. Meanwhile the constant literal is parsed to a string Literal.  So 
it won't return true forever while doing the equal comparison. For instance: 
'5LZS'=='5LZS  '.

But I can get correct result in Hive using same data and sql string because 
hive append spaces for those constant literal. Please refer to:
https://issues.apache.org/jira/browse/HIVE-11312

I found there is no such patch for spark.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17197) PySpark LiR/LoR supports tree aggregation level configurable

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17197:


Assignee: Apache Spark

> PySpark LiR/LoR supports tree aggregation level configurable
> 
>
> Key: SPARK-17197
> URL: https://issues.apache.org/jira/browse/SPARK-17197
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-17090 make tree aggregation level in LiR/LoR configurable, this jira is 
> used to make PySpark support this function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17197) PySpark LiR/LoR supports tree aggregation level configurable

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432072#comment-15432072
 ] 

Apache Spark commented on SPARK-17197:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/14766

> PySpark LiR/LoR supports tree aggregation level configurable
> 
>
> Key: SPARK-17197
> URL: https://issues.apache.org/jira/browse/SPARK-17197
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> SPARK-17090 make tree aggregation level in LiR/LoR configurable, this jira is 
> used to make PySpark support this function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17197) PySpark LiR/LoR supports tree aggregation level configurable

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17197:


Assignee: (was: Apache Spark)

> PySpark LiR/LoR supports tree aggregation level configurable
> 
>
> Key: SPARK-17197
> URL: https://issues.apache.org/jira/browse/SPARK-17197
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> SPARK-17090 make tree aggregation level in LiR/LoR configurable, this jira is 
> used to make PySpark support this function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15815:


Assignee: Apache Spark

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Assignee: Apache Spark
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432068#comment-15432068
 ] 

Apache Spark commented on SPARK-15815:
--

User 'suyanNone' has created a pull request for this issue:
https://github.com/apache/spark/pull/14765

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15815:


Assignee: (was: Apache Spark)

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17197) PySpark LiR/LoR supports tree aggregation level configurable

2016-08-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17197:

Priority: Minor  (was: Major)

> PySpark LiR/LoR supports tree aggregation level configurable
> 
>
> Key: SPARK-17197
> URL: https://issues.apache.org/jira/browse/SPARK-17197
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> SPARK-17090 make tree aggregation level in LiR/LoR configurable, this jira is 
> used to make PySpark support this function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17197) PySpark LiR/LoR supports tree aggregation level configurable

2016-08-22 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-17197:
---

 Summary: PySpark LiR/LoR supports tree aggregation level 
configurable
 Key: SPARK-17197
 URL: https://issues.apache.org/jira/browse/SPARK-17197
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang


SPARK-17090 make tree aggregation level in LiR/LoR configurable, this jira is 
used to make PySpark support this function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-08-22 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432040#comment-15432040
 ] 

Egor Pahomov commented on SPARK-16334:
--

(just reason for reopen)

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Sameer Agarwal
>Priority: Critical
>  Labels: sql
> Fix For: 2.0.1, 2.1.0
>
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-16334) [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-08-22 Thread Egor Pahomov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov reopened SPARK-16334:
--

Seems like a lot of people still have a problem even after suggested fix

> [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Sameer Agarwal
>Priority: Critical
>  Labels: sql
> Fix For: 2.0.1, 2.1.0
>
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17196) Can not initializeing SparkConent plus Kerberos env

2016-08-22 Thread sangshenghong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sangshenghong updated SPARK-17196:
--
Description: 
When we submit a application and get the following exception :
java.lang.ClassNotFoundException: org.spark_project.protobuf.GeneratedMessage
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at 
com.spss.utilities.classloading.dynamicclassloader.ChildFirstDynamicClassLoader.loadClass(ChildFirstDynamicClassLoader.java:108)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$getClassFor$1.apply(DynamicAccess.scala:67)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$getClassFor$1.apply(DynamicAccess.scala:66)
at scala.util.Try$.apply(Try.scala:161)
at 
akka.actor.ReflectiveDynamicAccess.getClassFor(DynamicAccess.scala:66)
at 
akka.serialization.Serialization$$anonfun$6.apply(Serialization.scala:181)
at 
akka.serialization.Serialization$$anonfun$6.apply(Serialization.scala:181)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at 
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at akka.serialization.Serialization.(Serialization.scala:181)
at 
akka.serialization.SerializationExtension$.createExtension(SerializationExtension.scala:15)
at 
akka.serialization.SerializationExtension$.createExtension(SerializationExtension.scala:12)
at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:713)
at akka.actor.ExtensionId$class.apply(Extension.scala:79)
at 
akka.serialization.SerializationExtension$.apply(SerializationExtension.scala:12)
at 
akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:175)
at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:620)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:617)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:617)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:634)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:119)
at 
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1920)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1911)
at 
org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55)
at 
org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:253)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:53)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:254)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.(SparkContext.scala:450)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:75)


Also I checked spark assembly jar file and do not find the package  
org.spark_project ,just find the org\spark-project\. In the 1.3.1 version,it do 
exist package "org.spark_project".

  was:
When we submit a application and get the following exception :
java.lang.ClassNotFoundException: org.spark_project.protobuf.GeneratedMessage
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

[jira] [Created] (SPARK-17196) Can not initializeing SparkConent plus Kerberos env

2016-08-22 Thread sangshenghong (JIRA)
sangshenghong created SPARK-17196:
-

 Summary: Can not initializeing SparkConent plus Kerberos env
 Key: SPARK-17196
 URL: https://issues.apache.org/jira/browse/SPARK-17196
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
 Environment: HDP 2.3.4(Spark 1.5.2)+Kerberos
Reporter: sangshenghong


When we submit a application and get the following exception :
java.lang.ClassNotFoundException: org.spark_project.protobuf.GeneratedMessage
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at 
com.spss.utilities.classloading.dynamicclassloader.ChildFirstDynamicClassLoader.loadClass(ChildFirstDynamicClassLoader.java:108)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$getClassFor$1.apply(DynamicAccess.scala:67)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$getClassFor$1.apply(DynamicAccess.scala:66)
at scala.util.Try$.apply(Try.scala:161)
at 
akka.actor.ReflectiveDynamicAccess.getClassFor(DynamicAccess.scala:66)
at 
akka.serialization.Serialization$$anonfun$6.apply(Serialization.scala:181)
at 
akka.serialization.Serialization$$anonfun$6.apply(Serialization.scala:181)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at 
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at akka.serialization.Serialization.(Serialization.scala:181)
at 
akka.serialization.SerializationExtension$.createExtension(SerializationExtension.scala:15)
at 
akka.serialization.SerializationExtension$.createExtension(SerializationExtension.scala:12)
at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:713)
at akka.actor.ExtensionId$class.apply(Extension.scala:79)
at 
akka.serialization.SerializationExtension$.apply(SerializationExtension.scala:12)
at 
akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:175)
at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:620)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:617)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:617)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:634)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:119)
at 
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1920)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1911)
at 
org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55)
at 
org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:253)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:53)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:254)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
at org.apache.spark.SparkContext.(SparkContext.scala:450)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper

2016-08-22 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-17157:
---
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-16442

> Add multiclass logistic regression SparkR Wrapper
> -
>
> Key: SPARK-17157
> URL: https://issues.apache.org/jira/browse/SPARK-17157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Miao Wang
>
> [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
> merged to Master. I open this JIRA for discussion of adding SparkR wrapper 
> for multiclass logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss

2016-08-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431965#comment-15431965
 ] 

Hyukjin Kwon commented on SPARK-17174:
--

I just took a look for others as references.

It seems Hive is also doing this, 
https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAddMonths.java#L48-L51

Oracle's also returns also date types, 
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions004.htm

It seems the input timestamp type is being converted into date types as below:

{code}
Seq(Tuple1(Timestamp.valueOf("2012-07-16 12:12:12"))).toDF("ts")
  .selectExpr("add_months(ts, 1)", "date_add(ts, 1)")
  .show()
{code}

prints as below:

{code}
+---+-+
|add_months(CAST(ts AS DATE), 1)|date_add(CAST(ts AS DATE), 1)|
+---+-+
| 2012-08-16|   2012-07-17|
+---+-+
{code}

It seems there is a discussion about this here, 
https://github.com/apache/spark/pull/7589#discussion_r35186500

So, I believe it'd make sense that we document this behaviour for expression 
description like Hive does 
https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAddMonths.java#L48-L51

Do you mind if i submit a PR for documentation for this?


> Provide support for Timestamp type Column in add_months function to return 
> HH:mm:ss
> ---
>
> Key: SPARK-17174
> URL: https://issues.apache.org/jira/browse/SPARK-17174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Amit Baghel
>Priority: Minor
>
> add_months function currently supports Date types. If Column is Timestamp 
> type then it adds month to date but it doesn't return timestamp part 
> (HH:mm:ss). See the code below.
> {code}
> import java.util.Calendar
> val now = Calendar.getInstance().getTime()
> val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new 
> java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS")
> df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show
> {code}
> Above code gives following response. See the HH:mm:ss is missing from 
> NewDateWithTS column.
> {code}
> +---++-+
> | ID|  DateWithTS|NewDateWithTS|
> +---++-+
> |  0|2016-01-21 09:38:...|   2016-02-21|
> |  1|2016-02-21 09:38:...|   2016-03-21|
> |  2|2016-03-21 09:38:...|   2016-04-21|
> |  3|2016-04-21 09:38:...|   2016-05-21|
> +---++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17195) Dealing with JDBC column nullability when it is not reliable

2016-08-22 Thread Jason Moore (JIRA)
Jason Moore created SPARK-17195:
---

 Summary: Dealing with JDBC column nullability when it is not 
reliable
 Key: SPARK-17195
 URL: https://issues.apache.org/jira/browse/SPARK-17195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Jason Moore


Starting with Spark 2.0.0, the column "nullable" property is important to have 
correct for the code generation to work properly.  Marking the column as 
nullable = false used to (<2.0.0) allow null values to be operated on, but now 
this will result in:

{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
{noformat}

I'm all for the change towards a more ridged behavior (enforcing correct 
input).  But the problem I'm facing now is that when I used JDBC to read from a 
Teradata server, the column nullability is often not correct (particularly when 
sub-queries are involved).

This is the line in question:
https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140

I'm trying to work out what would be the way forward for me on this.  I know 
that it's really the fault of the Teradata database server not returning the 
correct schema, but I'll need to make Spark itself or my application resilient 
to this behavior.

One of the Teradata JDBC Driver tech leads has told me that "when the 
rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
string, then the other metadata values may not be completely accurate" - so one 
option could be to treat the nullability (at least) the same way as the 
"unknown" case (as nullable = true).  For reference, see the rest of our 
discussion here: 
http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability

Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17182) CollectList and CollectSet should be marked as non-deterministic

2016-08-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17182.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> CollectList and CollectSet should be marked as non-deterministic
> 
>
> Key: SPARK-17182
> URL: https://issues.apache.org/jira/browse/SPARK-17182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.1, 2.1.0
>
>
> {{CollectList}} and {{CollectSet}} should be marked as non-deterministic 
> since their results depend on the actual order of input rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17184) Replace ByteBuf with InputStream

2016-08-22 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431903#comment-15431903
 ] 

Guoqiang Li commented on SPARK-17184:
-

OK, I'll post the detailed design document later.

> Replace ByteBuf with InputStream
> 
>
> Key: SPARK-17184
> URL: https://issues.apache.org/jira/browse/SPARK-17184
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> The size of ByteBuf can not be greater than 2G, should be replaced by 
> InputStream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17194) When emitting SQL for string literals Spark should use single quotes, not double

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17194:


Assignee: Apache Spark  (was: Josh Rosen)

> When emitting SQL for string literals Spark should use single quotes, not 
> double
> 
>
> Key: SPARK-17194
> URL: https://issues.apache.org/jira/browse/SPARK-17194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Minor
>
> When Spark emits SQL for a string literal, it should wrap the string in 
> single quotes, not double quotes. Databases which adhere more strictly to the 
> ANSI SQL standards, such as Postgres, allow only single-quotes to be used for 
> denoting string literals (see http://stackoverflow.com/a/1992331/590203).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17194) When emitting SQL for string literals Spark should use single quotes, not double

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17194:


Assignee: Josh Rosen  (was: Apache Spark)

> When emitting SQL for string literals Spark should use single quotes, not 
> double
> 
>
> Key: SPARK-17194
> URL: https://issues.apache.org/jira/browse/SPARK-17194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> When Spark emits SQL for a string literal, it should wrap the string in 
> single quotes, not double quotes. Databases which adhere more strictly to the 
> ANSI SQL standards, such as Postgres, allow only single-quotes to be used for 
> denoting string literals (see http://stackoverflow.com/a/1992331/590203).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17194) When emitting SQL for string literals Spark should use single quotes, not double

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431873#comment-15431873
 ] 

Apache Spark commented on SPARK-17194:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14763

> When emitting SQL for string literals Spark should use single quotes, not 
> double
> 
>
> Key: SPARK-17194
> URL: https://issues.apache.org/jira/browse/SPARK-17194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> When Spark emits SQL for a string literal, it should wrap the string in 
> single quotes, not double quotes. Databases which adhere more strictly to the 
> ANSI SQL standards, such as Postgres, allow only single-quotes to be used for 
> denoting string literals (see http://stackoverflow.com/a/1992331/590203).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17194) When emitting SQL for string literals Spark should use single quotes, not double

2016-08-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-17194:
--

Assignee: Josh Rosen

> When emitting SQL for string literals Spark should use single quotes, not 
> double
> 
>
> Key: SPARK-17194
> URL: https://issues.apache.org/jira/browse/SPARK-17194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> When Spark emits SQL for a string literal, it should wrap the string in 
> single quotes, not double quotes. Databases which adhere more strictly to the 
> ANSI SQL standards, such as Postgres, allow only single-quotes to be used for 
> denoting string literals (see http://stackoverflow.com/a/1992331/590203).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17194) When emitting SQL for string literals, Spark should use single quotes, not double

2016-08-22 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-17194:
--

 Summary: When emitting SQL for string literals, Spark should use 
single quotes, not double
 Key: SPARK-17194
 URL: https://issues.apache.org/jira/browse/SPARK-17194
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Priority: Minor


When Spark emits SQL for a string literal, it should wrap the string in single 
quotes, not double quotes. Databases which adhere more strictly to the ANSI SQL 
standards, such as Postgres, allow only single-quotes to be used for denoting 
string literals (see http://stackoverflow.com/a/1992331/590203).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17194) When emitting SQL for string literals Spark should use single quotes, not double

2016-08-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17194:
---
Summary: When emitting SQL for string literals Spark should use single 
quotes, not double  (was: When emitting SQL for string literals, Spark should 
use single quotes, not double)

> When emitting SQL for string literals Spark should use single quotes, not 
> double
> 
>
> Key: SPARK-17194
> URL: https://issues.apache.org/jira/browse/SPARK-17194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Priority: Minor
>
> When Spark emits SQL for a string literal, it should wrap the string in 
> single quotes, not double quotes. Databases which adhere more strictly to the 
> ANSI SQL standards, such as Postgres, allow only single-quotes to be used for 
> denoting string literals (see http://stackoverflow.com/a/1992331/590203).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16577) Add check-cran script to Jenkins

2016-08-22 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-16577.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14759
[https://github.com/apache/spark/pull/14759]

> Add check-cran script to Jenkins
> 
>
> Key: SPARK-16577
> URL: https://issues.apache.org/jira/browse/SPARK-16577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 2.0.1, 2.1.0
>
>
> After we have fixed the warnings from the CRAN checks we should add this as a 
> part of the Jenkins build.
> This depends on SPARK-16507 and SPARK-16508 being resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17144) Removal of useless CreateHiveTableAsSelectLogicalPlan

2016-08-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17144.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14707
[https://github.com/apache/spark/pull/14707]

> Removal of useless CreateHiveTableAsSelectLogicalPlan
> -
>
> Key: SPARK-17144
> URL: https://issues.apache.org/jira/browse/SPARK-17144
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> {{CreateHiveTableAsSelectLogicalPlan}} is a dead code after refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17144) Removal of useless CreateHiveTableAsSelectLogicalPlan

2016-08-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17144:

Assignee: Xiao Li

> Removal of useless CreateHiveTableAsSelectLogicalPlan
> -
>
> Key: SPARK-17144
> URL: https://issues.apache.org/jira/browse/SPARK-17144
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> {{CreateHiveTableAsSelectLogicalPlan}} is a dead code after refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-08-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15767:
--
Shepherd:   (was: Joseph K. Bradley)

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.rpart(dataframe, formula, ...) .  After having implemented 
> decision tree classification, we could refactor this two into an API more 
> like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16550) Caching data with replication doesn't replicate data

2016-08-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16550.
-
   Resolution: Fixed
 Assignee: Eric Liang  (was: Josh Rosen)
Fix Version/s: 2.1.0
   2.0.1

> Caching data with replication doesn't replicate data
> 
>
> Key: SPARK-16550
> URL: https://issues.apache.org/jira/browse/SPARK-16550
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.0.0
>Reporter: Shubham Chopra
>Assignee: Eric Liang
> Fix For: 2.0.1, 2.1.0
>
>
> Caching multiple replicas of blocks is currently broken. The following 
> examples show replication doesn't happen for various use-cases:
> These were run using Spark 2.0.0-preview, in local-cluster[2,1,1024] mode
> {noformat}
> case class TestInteger(i: Int)
> val data = sc.parallelize((1 to 1000).map(TestInteger(_)), 
> 10).persist(MEMORY_ONLY_2)
> data.count
> {noformat}
> sc.getExecutorStorageStatus.map(s => s.rddBlocksById(data.id).size).sum shows 
> only 10 blocks as opposed to the expected 20
> Block replication fails on the executors with a java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: $line14.$read$$iw$$iw$TestInteger
> {noformat}
> val data1 = sc.parallelize(1 to 1000, 
> 10).persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_2)
> data1.count
> Block replication again fails with the following errors:
> 16/07/14 14:50:40 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() on RPC id 8567643992794608648
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutBytes$1.apply(BlockManager.scala:775)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutBytes$1.apply(BlockManager.scala:753)
> {noformat}
> sc.getExecutorStorageStatus.map(s => s.rddBlocksById(data1.id).size).sum 
> again shows 10 blocks
> Caching serialized data works for native types, but not for custom classes
> {noformat}
> val data3 = sc.parallelize(1 to 1000, 10).persist(MEMORY_ONLY_SER_2)
> data3.count
> {noformat}
> works as intended.
> But 
> {noformat}
> val data4 = sc.parallelize((1 to 1000).map(TestInteger(_)), 
> 10).persist(MEMORY_ONLY_SER_2)
> data4.count
> {noformat}
> Again doesn't replicate data and executors show the same 
> ClassNotFoundException
> These examples worked fine and showed expected results with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17042) Repl-defined classes cannot be replicated

2016-08-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17042.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.1.0
   2.0.1

> Repl-defined classes cannot be replicated
> -
>
> Key: SPARK-17042
> URL: https://issues.apache.org/jira/browse/SPARK-17042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.0.1, 2.1.0
>
>
> A simple fix is to erase the classTag when using the default serializer, 
> since it's not needed in that case, and the classTag was failing to 
> deserialize on the remote end.
> The proper fix is actually to use the right classloader when deserializing 
> the classtags, but that is a much more invasive change for 2.0.
> The following test can be added to ReplSuite to reproduce the bug:
> {code}
>   test("replicating blocks of object with class defined in repl") {
> val output = runInterpreter("local-cluster[2,1,1024]",
>   """
> |import org.apache.spark.storage.StorageLevel._
> |case class Foo(i: Int)
> |val ret = sc.parallelize((1 to 100).map(Foo), 
> 10).persist(MEMORY_ONLY_2)
> |ret.count()
> |sc.getExecutorStorageStatus.map(s => 
> s.rddBlocksById(ret.id).size).sum
>   """.stripMargin)
> assertDoesNotContain("error:", output)
> assertDoesNotContain("Exception", output)
> assertContains(": Int = 20", output)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16566) Bug in SparseMatrix multiplication with SparseVector

2016-08-22 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-16566.
-
Resolution: Duplicate

Linking existing JIRA which this one is duplicating.  Could you please work 
under the other JIRA instead of this one?  Thanks!

> Bug in SparseMatrix multiplication with SparseVector
> 
>
> Key: SPARK-16566
> URL: https://issues.apache.org/jira/browse/SPARK-16566
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.2
>Reporter: Wilson
>
> In the org.apache.spark.mllib.linalg.BLAS.scala, the multiplication between 
> SparseMatrix (sm) and SparseVector (sv) when sm is not transposed assume that 
> the indices is sorted, but there is no validation to make sure that is the 
> case, making the result returned wrongly.
> This can be replicated simply by using spark-shell and entering these 
> commands:
> import org.apache.spark.mllib.linalg.SparseMatrix
> import org.apache.spark.mllib.linalg.SparseVector
> import org.apache.spark.mllib.linalg.DenseVector
> import scala.collection.mutable.ArrayBuffer
> val vectorIndices = Array(3,2)
> val vectorValues = Array(0.1,0.2)
> val size = 4
> val sm = new SparseMatrix(size, size, Array(0, 0, 0, 1, 1), Array(0), 
> Array(1.0))
> val dm = sm.toDense
> val sv = new SparseVector(size, vectorIndices, vectorValues)
> val dv = new DenseVector(s.toArray)
> sm.multiply(dv) == sm.multiply(sv)
> sm.multiply(dv)
> sm.multiply(sv)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state

2016-08-22 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431779#comment-15431779
 ] 

Frederick Reiss commented on SPARK-16963:
-

The proposed changes in the attached PR are now ready for review. [~marmbrus] 
can you please have a look at your convenience? [~prashant_] can you also 
please have a look with a particular focus on whether the changes fit with the 
MQTT connector?

> Change Source API so that sources do not need to keep unbounded state
> -
>
> Key: SPARK-16963
> URL: https://issues.apache.org/jira/browse/SPARK-16963
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Frederick Reiss
>
> The version of the Source API in Spark 2.0.0 defines a single getBatch() 
> method for fetching records from the source, with the following Scaladoc 
> comments defining the semantics:
> {noformat}
> /**
>  * Returns the data that is between the offsets (`start`, `end`]. When 
> `start` is `None` then
>  * the batch should begin with the first available record. This method must 
> always return the
>  * same data for a particular `start` and `end` pair.
>  */
> def getBatch(start: Option[Offset], end: Offset): DataFrame
> {noformat}
> These semantics mean that a Source must retain all past history for the 
> stream that it backs. Further, a Source is also required to retain this data 
> across restarts of the process where the Source is instantiated, even when 
> the Source is restarted on a different machine.
> These restrictions make it difficult to implement the Source API, as any 
> implementation requires potentially unbounded amounts of distributed storage.
> See the mailing list thread at 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html]
>  for more information.
> This JIRA will cover augmenting the Source API with an additional callback 
> that will allow Structured Streaming scheduler to notify the source when it 
> is safe to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16962) Unsafe accesses (Platform.getLong()) not supported on unaligned boundaries in SPARC/Solaris

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16962:


Assignee: Apache Spark

> Unsafe accesses (Platform.getLong()) not supported on unaligned boundaries in 
> SPARC/Solaris
> ---
>
> Key: SPARK-16962
> URL: https://issues.apache.org/jira/browse/SPARK-16962
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: SPARC/Solaris
>Reporter: Suman Somasundar
>Assignee: Apache Spark
>
> Unaligned accesses are not supported on SPARC architecture. Because of this, 
> Spark applications fail by dumping core on SPARC machines whenever unaligned 
> accesses happen. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16962) Unsafe accesses (Platform.getLong()) not supported on unaligned boundaries in SPARC/Solaris

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431762#comment-15431762
 ] 

Apache Spark commented on SPARK-16962:
--

User 'sumansomasundar' has created a pull request for this issue:
https://github.com/apache/spark/pull/14762

> Unsafe accesses (Platform.getLong()) not supported on unaligned boundaries in 
> SPARC/Solaris
> ---
>
> Key: SPARK-16962
> URL: https://issues.apache.org/jira/browse/SPARK-16962
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: SPARC/Solaris
>Reporter: Suman Somasundar
>
> Unaligned accesses are not supported on SPARC architecture. Because of this, 
> Spark applications fail by dumping core on SPARC machines whenever unaligned 
> accesses happen. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16962) Unsafe accesses (Platform.getLong()) not supported on unaligned boundaries in SPARC/Solaris

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16962:


Assignee: (was: Apache Spark)

> Unsafe accesses (Platform.getLong()) not supported on unaligned boundaries in 
> SPARC/Solaris
> ---
>
> Key: SPARK-16962
> URL: https://issues.apache.org/jira/browse/SPARK-16962
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: SPARC/Solaris
>Reporter: Suman Somasundar
>
> Unaligned accesses are not supported on SPARC architecture. Because of this, 
> Spark applications fail by dumping core on SPARC machines whenever unaligned 
> accesses happen. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17162) Range does not support SQL generation

2016-08-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17162.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.1.0
   2.0.1

> Range does not support SQL generation
> -
>
> Key: SPARK-17162
> URL: https://issues.apache.org/jira/browse/SPARK-17162
> Project: Spark
>  Issue Type: Bug
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> {code}
> scala> sql("create view a as select * from range(100)")
> 16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as 
> select * from range(100)
> java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, 
> splits=8)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
>   at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> ```
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17099) Incorrect result when HAVING clause is added to group by query

2016-08-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17099:
---
Target Version/s: 2.0.1, 2.1.0
   Fix Version/s: (was: 2.1.0)

> Incorrect result when HAVING clause is added to group by query
> --
>
> Key: SPARK-17099
> URL: https://issues.apache.org/jira/browse/SPARK-17099
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Critical
>
> Random query generation uncovered the following query which returns incorrect 
> results when run on Spark SQL. This wasn't the original query uncovered by 
> the generator, since I performed a bit of minimization to try to make it more 
> understandable.
> With the following tables:
> {code}
> val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
> val t2 = sc.parallelize(
>   Seq(
> (-769, -244),
> (-800, -409),
> (940, 86),
> (-507, 304),
> (-367, 158))
> ).toDF("int_col_2", "int_col_5")
> t1.registerTempTable("t1")
> t2.registerTempTable("t2")
> {code}
> Run
> {code}
> SELECT
>   (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
>  ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
> FROM t1
> RIGHT JOIN t2
>   ON (t2.int_col_2) = (t1.int_col_5)
> GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
>  COALESCE(t1.int_col_5, t2.int_col_2)
> HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, 
> t2.int_col_2)) * 2)
> {code}
> In Spark SQL, this returns an empty result set, whereas Postgres returns four 
> rows. However, if I omit the {{HAVING}} clause I see that the group's rows 
> are being incorrectly filtered by the {{HAVING}} clause:
> {code}
> +--+---+--+
> | sum(coalesce(int_col_5, int_col_2))  | (coalesce(int_col_5, int_col_2) * 2) 
>  |
> +--+---+--+
> | -507 | -1014
>  |
> | 940  | 1880 
>  |
> | -769 | -1538
>  |
> | -367 | -734 
>  |
> | -800 | -1600
>  |
> +--+---+--+
> {code}
> Based on this, the output after adding the {{HAVING}} should contain four 
> rows, not zero.
> I'm not sure how to further shrink this in a straightforward way, so I'm 
> opening this bug to get help in triaging further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17042) Repl-defined classes cannot be replicated

2016-08-22 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431616#comment-15431616
 ] 

Eric Liang commented on SPARK-17042:


Yeah, my bad. I was trying to split this up but it turns out to be unnecessary.

> Repl-defined classes cannot be replicated
> -
>
> Key: SPARK-17042
> URL: https://issues.apache.org/jira/browse/SPARK-17042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Eric Liang
>
> A simple fix is to erase the classTag when using the default serializer, 
> since it's not needed in that case, and the classTag was failing to 
> deserialize on the remote end.
> The proper fix is actually to use the right classloader when deserializing 
> the classtags, but that is a much more invasive change for 2.0.
> The following test can be added to ReplSuite to reproduce the bug:
> {code}
>   test("replicating blocks of object with class defined in repl") {
> val output = runInterpreter("local-cluster[2,1,1024]",
>   """
> |import org.apache.spark.storage.StorageLevel._
> |case class Foo(i: Int)
> |val ret = sc.parallelize((1 to 100).map(Foo), 
> 10).persist(MEMORY_ONLY_2)
> |ret.count()
> |sc.getExecutorStorageStatus.map(s => 
> s.rddBlocksById(ret.id).size).sum
>   """.stripMargin)
> assertDoesNotContain("error:", output)
> assertDoesNotContain("Exception", output)
> assertContains(": Int = 20", output)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17042) Repl-defined classes cannot be replicated

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17042:


Assignee: Apache Spark

> Repl-defined classes cannot be replicated
> -
>
> Key: SPARK-17042
> URL: https://issues.apache.org/jira/browse/SPARK-17042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> A simple fix is to erase the classTag when using the default serializer, 
> since it's not needed in that case, and the classTag was failing to 
> deserialize on the remote end.
> The proper fix is actually to use the right classloader when deserializing 
> the classtags, but that is a much more invasive change for 2.0.
> The following test can be added to ReplSuite to reproduce the bug:
> {code}
>   test("replicating blocks of object with class defined in repl") {
> val output = runInterpreter("local-cluster[2,1,1024]",
>   """
> |import org.apache.spark.storage.StorageLevel._
> |case class Foo(i: Int)
> |val ret = sc.parallelize((1 to 100).map(Foo), 
> 10).persist(MEMORY_ONLY_2)
> |ret.count()
> |sc.getExecutorStorageStatus.map(s => 
> s.rddBlocksById(ret.id).size).sum
>   """.stripMargin)
> assertDoesNotContain("error:", output)
> assertDoesNotContain("Exception", output)
> assertContains(": Int = 20", output)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17042) Repl-defined classes cannot be replicated

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17042:


Assignee: (was: Apache Spark)

> Repl-defined classes cannot be replicated
> -
>
> Key: SPARK-17042
> URL: https://issues.apache.org/jira/browse/SPARK-17042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Eric Liang
>
> A simple fix is to erase the classTag when using the default serializer, 
> since it's not needed in that case, and the classTag was failing to 
> deserialize on the remote end.
> The proper fix is actually to use the right classloader when deserializing 
> the classtags, but that is a much more invasive change for 2.0.
> The following test can be added to ReplSuite to reproduce the bug:
> {code}
>   test("replicating blocks of object with class defined in repl") {
> val output = runInterpreter("local-cluster[2,1,1024]",
>   """
> |import org.apache.spark.storage.StorageLevel._
> |case class Foo(i: Int)
> |val ret = sc.parallelize((1 to 100).map(Foo), 
> 10).persist(MEMORY_ONLY_2)
> |ret.count()
> |sc.getExecutorStorageStatus.map(s => 
> s.rddBlocksById(ret.id).size).sum
>   """.stripMargin)
> assertDoesNotContain("error:", output)
> assertDoesNotContain("Exception", output)
> assertContains(": Int = 20", output)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17042) Repl-defined classes cannot be replicated

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431614#comment-15431614
 ] 

Apache Spark commented on SPARK-17042:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/14311

> Repl-defined classes cannot be replicated
> -
>
> Key: SPARK-17042
> URL: https://issues.apache.org/jira/browse/SPARK-17042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Eric Liang
>
> A simple fix is to erase the classTag when using the default serializer, 
> since it's not needed in that case, and the classTag was failing to 
> deserialize on the remote end.
> The proper fix is actually to use the right classloader when deserializing 
> the classtags, but that is a much more invasive change for 2.0.
> The following test can be added to ReplSuite to reproduce the bug:
> {code}
>   test("replicating blocks of object with class defined in repl") {
> val output = runInterpreter("local-cluster[2,1,1024]",
>   """
> |import org.apache.spark.storage.StorageLevel._
> |case class Foo(i: Int)
> |val ret = sc.parallelize((1 to 100).map(Foo), 
> 10).persist(MEMORY_ONLY_2)
> |ret.count()
> |sc.getExecutorStorageStatus.map(s => 
> s.rddBlocksById(ret.id).size).sum
>   """.stripMargin)
> assertDoesNotContain("error:", output)
> assertDoesNotContain("Exception", output)
> assertContains(": Int = 20", output)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17193) HadoopRDD NPE at DEBUG log level when getLocationInfo == null

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17193:


Assignee: Apache Spark  (was: Sean Owen)

> HadoopRDD NPE at DEBUG log level when getLocationInfo == null
> -
>
> Key: SPARK-17193
> URL: https://issues.apache.org/jira/browse/SPARK-17193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Trivial
>
> When I set the log level to "DEBUG" in one of my apps that reads from 
> Parquet, I notice several NullPointerExceptions logged from 
> HadoopRDD.getPreferredLocations. 
> It doesn't affect executions as it just results in "no preferred locations". 
> It happens when InputSplitWithLocationInfo.getLocationInfo produces null, 
> which it may. The code just dereferences it however. 
> It's cleaner to check this directly (and maybe tighten up the code slightly) 
> and avoid polluting the log, though, it's just at debug level. No big deal, 
> but enough of an annoyance when I was debugging something that it's probably 
> worth zapping.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17193) HadoopRDD NPE at DEBUG log level when getLocationInfo == null

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17193:


Assignee: Sean Owen  (was: Apache Spark)

> HadoopRDD NPE at DEBUG log level when getLocationInfo == null
> -
>
> Key: SPARK-17193
> URL: https://issues.apache.org/jira/browse/SPARK-17193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
>
> When I set the log level to "DEBUG" in one of my apps that reads from 
> Parquet, I notice several NullPointerExceptions logged from 
> HadoopRDD.getPreferredLocations. 
> It doesn't affect executions as it just results in "no preferred locations". 
> It happens when InputSplitWithLocationInfo.getLocationInfo produces null, 
> which it may. The code just dereferences it however. 
> It's cleaner to check this directly (and maybe tighten up the code slightly) 
> and avoid polluting the log, though, it's just at debug level. No big deal, 
> but enough of an annoyance when I was debugging something that it's probably 
> worth zapping.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17193) HadoopRDD NPE at DEBUG log level when getLocationInfo == null

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431585#comment-15431585
 ] 

Apache Spark commented on SPARK-17193:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14760

> HadoopRDD NPE at DEBUG log level when getLocationInfo == null
> -
>
> Key: SPARK-17193
> URL: https://issues.apache.org/jira/browse/SPARK-17193
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
>
> When I set the log level to "DEBUG" in one of my apps that reads from 
> Parquet, I notice several NullPointerExceptions logged from 
> HadoopRDD.getPreferredLocations. 
> It doesn't affect executions as it just results in "no preferred locations". 
> It happens when InputSplitWithLocationInfo.getLocationInfo produces null, 
> which it may. The code just dereferences it however. 
> It's cleaner to check this directly (and maybe tighten up the code slightly) 
> and avoid polluting the log, though, it's just at debug level. No big deal, 
> but enough of an annoyance when I was debugging something that it's probably 
> worth zapping.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17193) HadoopRDD NPE at DEBUG log level when getLocationInfo == null

2016-08-22 Thread Sean Owen (JIRA)
Sean Owen created SPARK-17193:
-

 Summary: HadoopRDD NPE at DEBUG log level when getLocationInfo == 
null
 Key: SPARK-17193
 URL: https://issues.apache.org/jira/browse/SPARK-17193
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Trivial


When I set the log level to "DEBUG" in one of my apps that reads from Parquet, 
I notice several NullPointerExceptions logged from 
HadoopRDD.getPreferredLocations. 

It doesn't affect executions as it just results in "no preferred locations". It 
happens when InputSplitWithLocationInfo.getLocationInfo produces null, which it 
may. The code just dereferences it however. 

It's cleaner to check this directly (and maybe tighten up the code slightly) 
and avoid polluting the log, though, it's just at debug level. No big deal, but 
enough of an annoyance when I was debugging something that it's probably worth 
zapping.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17192) Issuing an exception when users specify the partitioning columns without a given schema

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17192:


Assignee: (was: Apache Spark)

> Issuing an exception when users specify the partitioning columns without a 
> given schema
> ---
>
> Key: SPARK-17192
> URL: https://issues.apache.org/jira/browse/SPARK-17192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> We need to issue an exception when users specify the partitioning columns 
> without a given schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17192) Issuing an exception when users specify the partitioning columns without a given schema

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431561#comment-15431561
 ] 

Apache Spark commented on SPARK-17192:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14572

> Issuing an exception when users specify the partitioning columns without a 
> given schema
> ---
>
> Key: SPARK-17192
> URL: https://issues.apache.org/jira/browse/SPARK-17192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> We need to issue an exception when users specify the partitioning columns 
> without a given schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17192) Issuing an exception when users specify the partitioning columns without a given schema

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17192:


Assignee: Apache Spark

> Issuing an exception when users specify the partitioning columns without a 
> given schema
> ---
>
> Key: SPARK-17192
> URL: https://issues.apache.org/jira/browse/SPARK-17192
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> We need to issue an exception when users specify the partitioning columns 
> without a given schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17192) Issuing an exception when users specify the partitioning columns without a given schema

2016-08-22 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17192:
---

 Summary: Issuing an exception when users specify the partitioning 
columns without a given schema
 Key: SPARK-17192
 URL: https://issues.apache.org/jira/browse/SPARK-17192
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


We need to issue an exception when users specify the partitioning columns 
without a given schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16991) Full outer join followed by inner join produces wrong results

2016-08-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16991:
-
Target Version/s: 2.0.1, 2.1.0

> Full outer join followed by inner join produces wrong results
> -
>
> Key: SPARK-16991
> URL: https://issues.apache.org/jira/browse/SPARK-16991
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Jonas Jarutis
>Priority: Critical
>
> I found strange behaviour using fullouter join in combination with inner 
> join. It seems that inner join can't match values correctly after full outer 
> join. Here is a reproducible example in spark 2.0.
> {code}
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val a = Seq((1,2),(2,3)).toDF("a","b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2,5),(3,4)).toDF("a","c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val c = Seq((3,1)).toDF("a","d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter")
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+++
> |  a|   b|   c|
> +---+++
> |  1|   2|null|
> |  3|null|   4|
> |  2|   3|   5|
> +---+++
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> Meanwhile, without the full outer, inner join works fine.
> {code}
> scala> b.join(c, "a").show
> +---+---+---+
> |  a|  c|  d|
> +---+---+---+
> |  3|  4|  1|
> +---+---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17191) Install e1071 R package on Jenkins machines

2016-08-22 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431537#comment-15431537
 ] 

Shivaram Venkataraman commented on SPARK-17191:
---

Thanks Shane !

> Install e1071 R package on Jenkins machines
> ---
>
> Key: SPARK-17191
> URL: https://issues.apache.org/jira/browse/SPARK-17191
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: shane knapp
>Priority: Minor
>
> For running the CRAN checks on Jenkins machines, we need all suggested 
> packages to be installed. This includes the R package called e1071 which is 
> available at
> https://cran.r-project.org/web/packages/e1071/index.html
> I think running something like
> Rscript -e 'install.packages("e1071", repos="http://cran.stat.ucla.edu/;)'
> on all the machines should do the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17191) Install e1071 R package on Jenkins machines

2016-08-22 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp closed SPARK-17191.
---

> Install e1071 R package on Jenkins machines
> ---
>
> Key: SPARK-17191
> URL: https://issues.apache.org/jira/browse/SPARK-17191
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: shane knapp
>Priority: Minor
>
> For running the CRAN checks on Jenkins machines, we need all suggested 
> packages to be installed. This includes the R package called e1071 which is 
> available at
> https://cran.r-project.org/web/packages/e1071/index.html
> I think running something like
> Rscript -e 'install.packages("e1071", repos="http://cran.stat.ucla.edu/;)'
> on all the machines should do the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17191) Install e1071 R package on Jenkins machines

2016-08-22 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-17191.
-
Resolution: Fixed

> Install e1071 R package on Jenkins machines
> ---
>
> Key: SPARK-17191
> URL: https://issues.apache.org/jira/browse/SPARK-17191
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: shane knapp
>Priority: Minor
>
> For running the CRAN checks on Jenkins machines, we need all suggested 
> packages to be installed. This includes the R package called e1071 which is 
> available at
> https://cran.r-project.org/web/packages/e1071/index.html
> I think running something like
> Rscript -e 'install.packages("e1071", repos="http://cran.stat.ucla.edu/;)'
> on all the machines should do the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17191) Install e1071 R package on Jenkins machines

2016-08-22 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-17191:
-

 Summary: Install e1071 R package on Jenkins machines
 Key: SPARK-17191
 URL: https://issues.apache.org/jira/browse/SPARK-17191
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: shane knapp
Priority: Minor


For running the CRAN checks on Jenkins machines, we need all suggested packages 
to be installed. This includes the R package called e1071 which is available at
https://cran.r-project.org/web/packages/e1071/index.html

I think running something like

Rscript -e 'install.packages("e1071", repos="http://cran.stat.ucla.edu/;)'

on all the machines should do the trick.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16962) Unsafe accesses (Platform.getLong()) not supported on unaligned boundaries in SPARC/Solaris

2016-08-22 Thread Suman Somasundar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413970#comment-15413970
 ] 

Suman Somasundar edited comment on SPARK-16962 at 8/22/16 8:01 PM:
---

I am working on a fix for this issue with other Oracle engineers ([~jlhitt] & 
[~erik.oshaughnessy] ), who may also comment and contribute on this


was (Author: sumansomasundar):
We are working on a fix for this issue. I will submit a patch soon.

> Unsafe accesses (Platform.getLong()) not supported on unaligned boundaries in 
> SPARC/Solaris
> ---
>
> Key: SPARK-16962
> URL: https://issues.apache.org/jira/browse/SPARK-16962
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: SPARC/Solaris
>Reporter: Suman Somasundar
>
> Unaligned accesses are not supported on SPARC architecture. Because of this, 
> Spark applications fail by dumping core on SPARC machines whenever unaligned 
> accesses happen. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16578) Configurable hostname for RBackend

2016-08-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431495#comment-15431495
 ] 

Felix Cheung edited comment on SPARK-16578 at 8/22/16 7:52 PM:
---

+1 on this.
Some discussions and context/data-points on RBackend API or connect to remote 
JVM in SPARK-16581.



was (Author: felixcheung):
+1 on this.
Some discussions on RBackend API or connect to remote JVM in SPARK-16581.


> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend

2016-08-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431495#comment-15431495
 ] 

Felix Cheung commented on SPARK-16578:
--

+1 on this.
Some discussions on RBackend API or connect to remote JVM in SPARK-16581.


> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public

2016-08-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431492#comment-15431492
 ] 

Felix Cheung commented on SPARK-16581:
--

Certainly - I don't think we should bite off more than we could chew.

re: API - perhaps we should consider an S4 class as a wrapper/context (and 
changes the R->JVM functions to S4) - it would make it easier to 
update/evolve/make breaking changes to them.


> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper

2016-08-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431484#comment-15431484
 ] 

Felix Cheung commented on SPARK-17157:
--

Sounds like good to have.
Please check up the latest changes in the code and add this as a subtask to 
SPARK-16442

> Add multiclass logistic regression SparkR Wrapper
> -
>
> Key: SPARK-17157
> URL: https://issues.apache.org/jira/browse/SPARK-17157
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
> merged to Master. I open this JIRA for discussion of adding SparkR wrapper 
> for multiclass logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17173) Refactor R mllib for easier ml implementations

2016-08-22 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-17173.
--
  Resolution: Fixed
   Fix Version/s: 2.1.0
Target Version/s: 2.1.0

> Refactor R mllib for easier ml implementations
> --
>
> Key: SPARK-17173
> URL: https://issues.apache.org/jira/browse/SPARK-17173
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431409#comment-15431409
 ] 

Andrew Davidson commented on SPARK-17172:
-

Hi Sean

I forgot about that older jira issue. I never resolved it. I am using juypter. 
I believe each notebook gets it own spark context. I googled around and found 
some old issue that seem to suggest that a hive and sql context where being 
created . I have not figure out how to either use a different database for the 
hive context or prevent the original spark context from being created.



> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss

2016-08-22 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-17174:

Component/s: SQL

> Provide support for Timestamp type Column in add_months function to return 
> HH:mm:ss
> ---
>
> Key: SPARK-17174
> URL: https://issues.apache.org/jira/browse/SPARK-17174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Amit Baghel
>Priority: Minor
>
> add_months function currently supports Date types. If Column is Timestamp 
> type then it adds month to date but it doesn't return timestamp part 
> (HH:mm:ss). See the code below.
> {code}
> import java.util.Calendar
> val now = Calendar.getInstance().getTime()
> val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new 
> java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS")
> df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show
> {code}
> Above code gives following response. See the HH:mm:ss is missing from 
> NewDateWithTS column.
> {code}
> +---++-+
> | ID|  DateWithTS|NewDateWithTS|
> +---++-+
> |  0|2016-01-21 09:38:...|   2016-02-21|
> |  1|2016-02-21 09:38:...|   2016-03-21|
> |  2|2016-03-21 09:38:...|   2016-04-21|
> |  3|2016-04-21 09:38:...|   2016-05-21|
> +---++-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431384#comment-15431384
 ] 

Sean Owen commented on SPARK-17172:
---

That seems in order then, though there's an error about it. I think it's 
actually saying this because of the error, which you see farther down. 

Another instance of Derby may have already booted the database

Isn't this the same then as a third JIRA you opened?
https://issues.apache.org/jira/browse/SPARK-15506

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6509) MDLP discretizer

2016-08-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431378#comment-15431378
 ] 

Sean Owen commented on SPARK-6509:
--

The outcome of many "add X to MLlib" proposals, where it's not clear obvious 
interest in adding it straight away, is to implement it outside Spark and 
perhaps let it demonstrate from there that it's used widely. This is how things 
like CSV parsing came in. MLlib implementations are so separable that we don't 
really need or even want everything to be part of Spark itself. Some things are 
useful just niche.

> MDLP discretizer
> 
>
> Key: SPARK-6509
> URL: https://issues.apache.org/jira/browse/SPARK-6509
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Sergio Ramírez
>
> Minimum Description Lenght Discretizer
> This method implements Fayyad's discretizer [1] based on Minimum Description 
> Length Principle (MDLP) in order to treat non discrete datasets from a 
> distributed perspective. We have developed a distributed version from the 
> original one performing some important changes.
> -- Improvements on discretizer:
> Support for sparse data.
> Multi-attribute processing. The whole process is carried out in a single 
> step when the number of boundary points per attribute fits well in one 
> partition (<= 100K boundary points per attribute).
> Support for attributes with a huge number of boundary points (> 100K 
> boundary points per attribute). Rare situation.
> This software has been proved with two large real-world datasets such as:
> A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 
> competition, which comes from the Protein Structure Prediction field 
> (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 
> 631 attributes, 2 classes, 98% of negative examples and occupies, when 
> uncompressed, about 56GB of disk space.
> Epsilon dataset: 
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. 
> 400K instances and 2K attributes
> We have demonstrated that our method performs 300 times faster than the 
> sequential version for the first dataset, and also improves the accuracy for 
> Naive Bayes.
> Publication: S. Ramírez-Gallego, S. García, H. Mouriño-Talin, D. 
> Martínez-Rego, V. Bolón, A. Alonso-Betanzos, J.M. Benitez, F. Herrera. "Data 
> Discretization: Taxonomy and Big Data Challenge", WIRES Data Mining and 
> Knowledge Discovery. In press, 2015.
> Design doc: 
> https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing
> References
> [1] Fayyad, U., & Irani, K. (1993).
> "Multi-interval discretization of continuous-valued attributes for 
> classification learning."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431371#comment-15431371
 ] 

Andrew Davidson commented on SPARK-17172:
-

Hi Sean

the data center was created using spark-ec2 from spark-1.6.1-bin-hadoop2.6

ec2-user@ip-172-31-22-140 root]$ cat /root/spark/RELEASE 
Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0
Build flags: -Psparkr -Phadoop-1 -Phive -Phive-thriftserver 
-Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DzincPort=3032
[ec2-user@ip-172-31-22-140 root]$ 

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16495) Add ADMM optimizer in mllib package

2016-08-22 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431362#comment-15431362
 ] 

DB Tsai commented on SPARK-16495:
-

This is related to https://issues.apache.org/jira/browse/SPARK-17136  Once we 
have a optimizer interface in Spark ML, we can have an implementation of ADMM 
optimizer in Spark ML.

> Add ADMM optimizer in mllib package
> ---
>
> Key: SPARK-16495
> URL: https://issues.apache.org/jira/browse/SPARK-16495
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: zunwen you
>
>  Alternating Direction Method of Multipliers (ADMM) is well suited to 
> distributed convex optimization, and in particular to large-scale problems 
> arising in statistics, machine learning, and related areas.
> Details can be found in the [S. Boyd's 
> paper](http://www.stanford.edu/~boyd/papers/admm_distr_stats.html).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Document G1 heap region's effect on spark 2.0 vs 1.6

2016-08-22 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431351#comment-15431351
 ] 

Yin Huai commented on SPARK-16320:
--

After investigation, this perf issue was caused by GC setting 
(https://issues.apache.org/jira/browse/SPARK-16320?focusedCommentId=15421699=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15421699).
 [~maver1ck]  thank you for the investigation. [~srowen], thank you for sending 
out the doc to update the doc.

> Document G1 heap region's effect on spark 2.0 vs 1.6
> 
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16320) Document G1 heap region's effect on spark 2.0 vs 1.6

2016-08-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16320.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14732
[https://github.com/apache/spark/pull/14732]

> Document G1 heap region's effect on spark 2.0 vs 1.6
> 
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
> Attachments: spark1.6-ui.png, spark2-ui.png
>
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import struct
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop
> *UPDATE 2*
> Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same 
> source.
> I attached some VisualVM profiles there.
> Most interesting are from queries.
> https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps
> https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16577) Add check-cran script to Jenkins

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16577:


Assignee: Shivaram Venkataraman  (was: Apache Spark)

> Add check-cran script to Jenkins
> 
>
> Key: SPARK-16577
> URL: https://issues.apache.org/jira/browse/SPARK-16577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>
> After we have fixed the warnings from the CRAN checks we should add this as a 
> part of the Jenkins build.
> This depends on SPARK-16507 and SPARK-16508 being resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16577) Add check-cran script to Jenkins

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431344#comment-15431344
 ] 

Apache Spark commented on SPARK-16577:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/14759

> Add check-cran script to Jenkins
> 
>
> Key: SPARK-16577
> URL: https://issues.apache.org/jira/browse/SPARK-16577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>
> After we have fixed the warnings from the CRAN checks we should add this as a 
> part of the Jenkins build.
> This depends on SPARK-16507 and SPARK-16508 being resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16577) Add check-cran script to Jenkins

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16577:


Assignee: Apache Spark  (was: Shivaram Venkataraman)

> Add check-cran script to Jenkins
> 
>
> Key: SPARK-16577
> URL: https://issues.apache.org/jira/browse/SPARK-16577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> After we have fixed the warnings from the CRAN checks we should add this as a 
> part of the Jenkins build.
> This depends on SPARK-16507 and SPARK-16508 being resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6509) MDLP discretizer

2016-08-22 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431341#comment-15431341
 ] 

Barry Becker commented on SPARK-6509:
-

I may have missed the reasoning somewhere, but why was this marked wontfix? It 
seems like it would be a good addition.

> MDLP discretizer
> 
>
> Key: SPARK-6509
> URL: https://issues.apache.org/jira/browse/SPARK-6509
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Sergio Ramírez
>
> Minimum Description Lenght Discretizer
> This method implements Fayyad's discretizer [1] based on Minimum Description 
> Length Principle (MDLP) in order to treat non discrete datasets from a 
> distributed perspective. We have developed a distributed version from the 
> original one performing some important changes.
> -- Improvements on discretizer:
> Support for sparse data.
> Multi-attribute processing. The whole process is carried out in a single 
> step when the number of boundary points per attribute fits well in one 
> partition (<= 100K boundary points per attribute).
> Support for attributes with a huge number of boundary points (> 100K 
> boundary points per attribute). Rare situation.
> This software has been proved with two large real-world datasets such as:
> A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 
> competition, which comes from the Protein Structure Prediction field 
> (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 
> 631 attributes, 2 classes, 98% of negative examples and occupies, when 
> uncompressed, about 56GB of disk space.
> Epsilon dataset: 
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. 
> 400K instances and 2K attributes
> We have demonstrated that our method performs 300 times faster than the 
> sequential version for the first dataset, and also improves the accuracy for 
> Naive Bayes.
> Publication: S. Ramírez-Gallego, S. García, H. Mouriño-Talin, D. 
> Martínez-Rego, V. Bolón, A. Alonso-Betanzos, J.M. Benitez, F. Herrera. "Data 
> Discretization: Taxonomy and Big Data Challenge", WIRES Data Mining and 
> Knowledge Discovery. In press, 2015.
> Design doc: 
> https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing
> References
> [1] Fayyad, U., & Irani, K. (1993).
> "Multi-interval discretization of continuous-valued attributes for 
> classification learning."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2016-08-22 Thread Alexander Tronchin-James (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431322#comment-15431322
 ] 

Alexander Tronchin-James commented on SPARK-12394:
--

Where can we read about/contribute to efforts for implementing the filter on 
sorted data and optimized sort merge join strategies mentioned in the attached 
BucketedTables.pdf? Looking forward to these features!

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13030) Change OneHotEncoder to Estimator

2016-08-22 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431319#comment-15431319
 ] 

Nick Pentreath edited comment on SPARK-13030 at 8/22/16 6:08 PM:
-

Yes I also agree OHE needs to be an {{Estimator}} in order to actually be 
usable in a pipeline. Alternative is to have a "stateful" transformer - but IMO 
estimator makes more sense here. 

The issue we face is that OHE in 2.0 is locked down and we can't break things 
now - since it's no longer {{Experimental}}.

Though in many senses this can be viewed as a bug?!


was (Author: mlnick):
Yes I also agree OHE needs to be an {{Estimator}} in order to actually be 
usable in a pipeline. Alternative is to have a "stateful" transformer - but IMO 
estimator makes more sense here. 

The issue we face is that OHE in 2.0 is locked down and we can't break things 
now - since it's no longer {{Experimental}}.

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2016-08-22 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431319#comment-15431319
 ] 

Nick Pentreath commented on SPARK-13030:


Yes I also agree OHE needs to be an {{Estimator}} in order to actually be 
usable in a pipeline. Alternative is to have a "stateful" transformer - but IMO 
estimator makes more sense here. 

The issue we face is that OHE in 2.0 is locked down and we can't break things 
now - since it's no longer {{Experimental}}.

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11328) Provide more informative error message when direct parquet output committer is used and there is a file already exists error.

2016-08-22 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431275#comment-15431275
 ] 

Yin Huai commented on SPARK-11328:
--

Actually, the JIRA is meant to provide more information in the error message 
(see 
https://github.com/apache/spark/pull/10080/files#diff-244a70a91841bbddf6fae17c14c18ce4R137).
 This jira does not aim to get rid of the file already exists error.

> Provide more informative error message when direct parquet output committer 
> is used and there is a file already exists error.
> -
>
> Key: SPARK-11328
> URL: https://issues.apache.org/jira/browse/SPARK-11328
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Nong Li
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
>
> When saving data to S3 (e.g. saving to parquet), if there is an error during 
> the query execution, the partial file generated by the failed task will be 
> uploaded to S3 and the retries of this task will throw file already exist 
> error. It is very confusing to users because they may think that file already 
> exist error is the error causing the job failure. They can only find the real 
> error in the spark ui (in the stage page).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17190) Removal of HiveSharedState

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17190:


Assignee: Apache Spark

> Removal of HiveSharedState
> --
>
> Key: SPARK-17190
> URL: https://issues.apache.org/jira/browse/SPARK-17190
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Since `HiveClient` is used to interact with the Hive metastore, it should be 
> hidden in `HiveExternalCatalog`. After moving `HiveClient` into 
> `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of 
> `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes 
> straightforward. After removal of `HiveSharedState`, the reflection logic is 
> directly applied on the choice of `ExternalCatalog` types, based on the 
> configuration of `CATALOG_IMPLEMENTATION`. 
> `HiveClient` is also used/invoked by the other entities besides 
> HiveExternalCatalog, we defines the following two APIs:
> {noformat}
>   /**
>* Return the existing [[HiveClient]] used to interact with the metastore.
>*/
>   def getClient: HiveClient
>   /**
>* Return a [[HiveClient]] as a new session
>*/
>   def getNewClient: HiveClient
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17190) Removal of HiveSharedState

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17190:


Assignee: (was: Apache Spark)

> Removal of HiveSharedState
> --
>
> Key: SPARK-17190
> URL: https://issues.apache.org/jira/browse/SPARK-17190
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Since `HiveClient` is used to interact with the Hive metastore, it should be 
> hidden in `HiveExternalCatalog`. After moving `HiveClient` into 
> `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of 
> `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes 
> straightforward. After removal of `HiveSharedState`, the reflection logic is 
> directly applied on the choice of `ExternalCatalog` types, based on the 
> configuration of `CATALOG_IMPLEMENTATION`. 
> `HiveClient` is also used/invoked by the other entities besides 
> HiveExternalCatalog, we defines the following two APIs:
> {noformat}
>   /**
>* Return the existing [[HiveClient]] used to interact with the metastore.
>*/
>   def getClient: HiveClient
>   /**
>* Return a [[HiveClient]] as a new session
>*/
>   def getNewClient: HiveClient
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17190) Removal of HiveSharedState

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431251#comment-15431251
 ] 

Apache Spark commented on SPARK-17190:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14757

> Removal of HiveSharedState
> --
>
> Key: SPARK-17190
> URL: https://issues.apache.org/jira/browse/SPARK-17190
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Since `HiveClient` is used to interact with the Hive metastore, it should be 
> hidden in `HiveExternalCatalog`. After moving `HiveClient` into 
> `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of 
> `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes 
> straightforward. After removal of `HiveSharedState`, the reflection logic is 
> directly applied on the choice of `ExternalCatalog` types, based on the 
> configuration of `CATALOG_IMPLEMENTATION`. 
> `HiveClient` is also used/invoked by the other entities besides 
> HiveExternalCatalog, we defines the following two APIs:
> {noformat}
>   /**
>* Return the existing [[HiveClient]] used to interact with the metastore.
>*/
>   def getClient: HiveClient
>   /**
>* Return a [[HiveClient]] as a new session
>*/
>   def getNewClient: HiveClient
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17190) Removal of HiveSharedState

2016-08-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17190:

Issue Type: Improvement  (was: Bug)

> Removal of HiveSharedState
> --
>
> Key: SPARK-17190
> URL: https://issues.apache.org/jira/browse/SPARK-17190
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Since `HiveClient` is used to interact with the Hive metastore, it should be 
> hidden in `HiveExternalCatalog`. After moving `HiveClient` into 
> `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of 
> `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes 
> straightforward. After removal of `HiveSharedState`, the reflection logic is 
> directly applied on the choice of `ExternalCatalog` types, based on the 
> configuration of `CATALOG_IMPLEMENTATION`. 
> `HiveClient` is also used/invoked by the other entities besides 
> HiveExternalCatalog, we defines the following two APIs:
> {noformat}
>   /**
>* Return the existing [[HiveClient]] used to interact with the metastore.
>*/
>   def getClient: HiveClient
>   /**
>* Return a [[HiveClient]] as a new session
>*/
>   def getNewClient: HiveClient
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17190) Removal of HiveSharedState

2016-08-22 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17190:
---

 Summary: Removal of HiveSharedState
 Key: SPARK-17190
 URL: https://issues.apache.org/jira/browse/SPARK-17190
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Since `HiveClient` is used to interact with the Hive metastore, it should be 
hidden in `HiveExternalCatalog`. After moving `HiveClient` into 
`HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of 
`HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes 
straightforward. After removal of `HiveSharedState`, the reflection logic is 
directly applied on the choice of `ExternalCatalog` types, based on the 
configuration of `CATALOG_IMPLEMENTATION`. 

`HiveClient` is also used/invoked by the other entities besides 
HiveExternalCatalog, we defines the following two APIs:
{noformat}
  /**
   * Return the existing [[HiveClient]] used to interact with the metastore.
   */
  def getClient: HiveClient

  /**
   * Return a [[HiveClient]] as a new session
   */
  def getNewClient: HiveClient
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17189) [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431239#comment-15431239
 ] 

Apache Spark commented on SPARK-17189:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14756

> [MINOR] Looses the interface from UnsafeRow to InternalRow in 
> AggregationIterator if UnsafeRow specific method is not used
> --
>
> Key: SPARK-17189
> URL: https://issues.apache.org/jira/browse/SPARK-17189
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17189) [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17189:


Assignee: (was: Apache Spark)

> [MINOR] Looses the interface from UnsafeRow to InternalRow in 
> AggregationIterator if UnsafeRow specific method is not used
> --
>
> Key: SPARK-17189
> URL: https://issues.apache.org/jira/browse/SPARK-17189
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17189) [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17189:


Assignee: Apache Spark

> [MINOR] Looses the interface from UnsafeRow to InternalRow in 
> AggregationIterator if UnsafeRow specific method is not used
> --
>
> Key: SPARK-17189
> URL: https://issues.apache.org/jira/browse/SPARK-17189
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17189) [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used

2016-08-22 Thread Sean Zhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Zhong updated SPARK-17189:
---
Component/s: SQL

> [MINOR] Looses the interface from UnsafeRow to InternalRow in 
> AggregationIterator if UnsafeRow specific method is not used
> --
>
> Key: SPARK-17189
> URL: https://issues.apache.org/jira/browse/SPARK-17189
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17189) [MINOR] Looses the interface from UnsafeRow to InternalRow in AggregationIterator if UnsafeRow specific method is not used

2016-08-22 Thread Sean Zhong (JIRA)
Sean Zhong created SPARK-17189:
--

 Summary: [MINOR] Looses the interface from UnsafeRow to 
InternalRow in AggregationIterator if UnsafeRow specific method is not used
 Key: SPARK-17189
 URL: https://issues.apache.org/jira/browse/SPARK-17189
 Project: Spark
  Issue Type: Bug
Reporter: Sean Zhong
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17188:


Assignee: Apache Spark

> Moves QuantileSummaries to project catalyst from sql so that it can be used 
> to implement percentile_approx
> --
>
> Key: SPARK-17188
> URL: https://issues.apache.org/jira/browse/SPARK-17188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Apache Spark
>
> QuantileSummaries is a useful utility class to do statistics. It can be used 
> by aggregation function like percentile_approx.
> Currently, QuantileSummaries is located at project catalyst in package 
> org.apache.spark.sql.execution.stat, probably, we should move it to project 
> catalyst package org.apache.spark.sql.util.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431179#comment-15431179
 ] 

Apache Spark commented on SPARK-17188:
--

User 'clockfly' has created a pull request for this issue:
https://github.com/apache/spark/pull/14754

> Moves QuantileSummaries to project catalyst from sql so that it can be used 
> to implement percentile_approx
> --
>
> Key: SPARK-17188
> URL: https://issues.apache.org/jira/browse/SPARK-17188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sean Zhong
>
> QuantileSummaries is a useful utility class to do statistics. It can be used 
> by aggregation function like percentile_approx.
> Currently, QuantileSummaries is located at project catalyst in package 
> org.apache.spark.sql.execution.stat, probably, we should move it to project 
> catalyst package org.apache.spark.sql.util.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17188) Moves QuantileSummaries to project catalyst from sql so that it can be used to implement percentile_approx

2016-08-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17188:


Assignee: (was: Apache Spark)

> Moves QuantileSummaries to project catalyst from sql so that it can be used 
> to implement percentile_approx
> --
>
> Key: SPARK-17188
> URL: https://issues.apache.org/jira/browse/SPARK-17188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sean Zhong
>
> QuantileSummaries is a useful utility class to do statistics. It can be used 
> by aggregation function like percentile_approx.
> Currently, QuantileSummaries is located at project catalyst in package 
> org.apache.spark.sql.execution.stat, probably, we should move it to project 
> catalyst package org.apache.spark.sql.util.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >