[jira] [Commented] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query

2016-07-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368926#comment-15368926
 ] 

Apache Spark commented on SPARK-16456:
--

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/14111

> Reuse the uncorrelated scalar subqueries with the same logical plan in a query
> --
>
> Key: SPARK-16456
> URL: https://issues.apache.org/jira/browse/SPARK-16456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a 
> CTE could be executed multiple times, we should re-use the same result  to 
> avoid the duplicated computing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query

2016-07-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16456:


Assignee: Apache Spark

> Reuse the uncorrelated scalar subqueries with the same logical plan in a query
> --
>
> Key: SPARK-16456
> URL: https://issues.apache.org/jira/browse/SPARK-16456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>Assignee: Apache Spark
>
> In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a 
> CTE could be executed multiple times, we should re-use the same result  to 
> avoid the duplicated computing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query

2016-07-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16456:


Assignee: (was: Apache Spark)

> Reuse the uncorrelated scalar subqueries with the same logical plan in a query
> --
>
> Key: SPARK-16456
> URL: https://issues.apache.org/jira/browse/SPARK-16456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a 
> CTE could be executed multiple times, we should re-use the same result  to 
> avoid the duplicated computing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR

2016-07-08 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-16446:
---

Assignee: Yanbo Liang

> Gaussian Mixture Model wrapper in SparkR
> 
>
> Key: SPARK-16446
> URL: https://issues.apache.org/jira/browse/SPARK-16446
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Follow instructions in SPARK-16442 and implement Gaussian Mixture Model 
> wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR

2016-07-08 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368916#comment-15368916
 ] 

Yanbo Liang commented on SPARK-16446:
-

Sure.

> Gaussian Mixture Model wrapper in SparkR
> 
>
> Key: SPARK-16446
> URL: https://issues.apache.org/jira/browse/SPARK-16446
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>
> Follow instructions in SPARK-16442 and implement Gaussian Mixture Model 
> wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query

2016-07-08 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16456:
-
Summary: Reuse the uncorrelated scalar subqueries with the same logical 
plan in a query  (was: Reuse the uncorrelated scalar subqueries with with the 
same logical plan in a query)

> Reuse the uncorrelated scalar subqueries with the same logical plan in a query
> --
>
> Key: SPARK-16456
> URL: https://issues.apache.org/jira/browse/SPARK-16456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a 
> CTE could be executed multiple times, we should re-use the same result  to 
> avoid the duplicated computing on the same uncorrelated scalar subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16456) Reuse the uncorrelated scalar subqueries with the same logical plan in a query

2016-07-08 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16456:
-
Description: In TPCDS Q14, the same physical plan of uncorrelated scalar 
subqueries from a CTE could be executed multiple times, we should re-use the 
same result  to avoid the duplicated computing.  (was: In TPCDS Q14, the same 
physical plan of uncorrelated scalar subqueries from a CTE could be executed 
multiple times, we should re-use the same result  to avoid the duplicated 
computing on the same uncorrelated scalar subquery.)

> Reuse the uncorrelated scalar subqueries with the same logical plan in a query
> --
>
> Key: SPARK-16456
> URL: https://issues.apache.org/jira/browse/SPARK-16456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> In TPCDS Q14, the same physical plan of uncorrelated scalar subqueries from a 
> CTE could be executed multiple times, we should re-use the same result  to 
> avoid the duplicated computing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11857) Remove Mesos fine-grained mode subject to discussions

2016-07-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11857.
-
   Resolution: Fixed
 Assignee: Michael Gummelt  (was: Reynold Xin)
Fix Version/s: 2.0.0

> Remove Mesos fine-grained mode subject to discussions
> -
>
> Key: SPARK-11857
> URL: https://issues.apache.org/jira/browse/SPARK-11857
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Reynold Xin
>Assignee: Michael Gummelt
> Fix For: 2.0.0
>
>
> See discussions in
> http://apache-spark-developers-list.1001551.n3.nabble.com/Removing-the-Mesos-fine-grained-mode-td15277.html
> and
> http://apache-spark-developers-list.1001551.n3.nabble.com/Please-reply-if-you-use-Mesos-fine-grained-mode-td14930.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16432) Empty blocks fail to serialize due to assert in ChunkedByteBuffer

2016-07-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16432.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.0.0

> Empty blocks fail to serialize due to assert in ChunkedByteBuffer
> -
>
> Key: SPARK-16432
> URL: https://issues.apache.org/jira/browse/SPARK-16432
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> See https://github.com/apache/spark/pull/11748#issuecomment-230760283



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16376) [Spark web UI]:HTTP ERROR 500 when using rest api "/applications/[app-id]/jobs" if array "stageIds" is empty

2016-07-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16376.
-
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.0.0

> [Spark web UI]:HTTP ERROR 500 when using rest api 
> "/applications/[app-id]/jobs" if array "stageIds" is empty
> 
>
> Key: SPARK-16376
> URL: https://issues.apache.org/jira/browse/SPARK-16376
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: marymwu
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.0
>
>
> [Spark web UI]:HTTP ERROR 500 when using rest api 
> "/applications/[app-id]/jobs" if array "stageIds" is empty
> See attachment for reference.
> HTTP ERROR 500
> Problem accessing /api/v1/applications/application_1466239933301_175531/jobs. 
> Reason:
> Server Error
> Caused by:
> java.lang.UnsupportedOperationException: empty.max
>   at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:216)
>   at scala.collection.AbstractTraversable.max(Traversable.scala:105)
>   at 
> org.apache.spark.status.api.v1.AllJobsResource$.convertJobData(AllJobsResource.scala:71)
>   at 
> org.apache.spark.status.api.v1.AllJobsResource$$anonfun$2$$anonfun$apply$2.apply(AllJobsResource.scala:46)
>   at 
> org.apache.spark.status.api.v1.AllJobsResource$$anonfun$2$$anonfun$apply$2.apply(AllJobsResource.scala:44)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
>   at 
> org.apache.spark.status.api.v1.AllJobsResource$$anonfun$2.apply(AllJobsResource.scala:44)
>   at 
> org.apache.spark.status.api.v1.AllJobsResource$$anonfun$2.apply(AllJobsResource.scala:43)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$flatMap$2.apply(TraversableLike.scala:753)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$WithFilter.flatMap(TraversableLike.scala:752)
>   at 
> org.apache.spark.status.api.v1.AllJobsResource.jobsList(AllJobsResource.scala:43)
>   
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
>   at 
> com.sun.jersey.server.impl.uri.rules.SubLocatorRule.accept(SubLocatorRule.java:134)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>   at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>   at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
>   at 
> org.spark-project.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
>   at 
> org.spark-project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1496)
>   at 
> 

[jira] [Resolved] (SPARK-13569) Kafka DStreams from wildcard topic filters

2016-07-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-13569.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 14026
[https://github.com/apache/spark/pull/14026]

> Kafka DStreams from wildcard topic filters
> --
>
> Key: SPARK-13569
> URL: https://issues.apache.org/jira/browse/SPARK-13569
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Miguel Miranda
> Fix For: 2.0.0
>
>
> Expose Kafka's ConsumerConnector createMessageStreamsByFilter functionality 
> so that wildcards (whitelists and blacklists) can be used.
> Impacts KafkaUtil's createStream interface so that something other than a 
> list of maps can passed as an argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16455) Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling new tasks when cluster is restarting

2016-07-08 Thread YangyangLiu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YangyangLiu updated SPARK-16455:

Description: In our case, we are implementing a new mechanism which will 
let driver survive when cluster is temporarily down and restarting. So when the 
service provided by cluster is not available, scheduler should stop scheduling 
new tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order 
to avoid new task scheduling when it's necessary.  (was: In our case, we are 
implementing restartable resource manager mechanism which will let master 
executor survive when resource manager service is temporarily down. So when 
resource manager service is not available, scheduler should stop scheduling new 
tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to 
avoid new task scheduling  when it's necessary.)
Summary: Add a new hook in CoarseGrainedSchedulerBackend in order to 
stop scheduling new tasks when cluster is restarting  (was: Add a new hook in 
makeOffers in CoarseGrainedSchedulerBackend)

> Add a new hook in CoarseGrainedSchedulerBackend in order to stop scheduling 
> new tasks when cluster is restarting
> 
>
> Key: SPARK-16455
> URL: https://issues.apache.org/jira/browse/SPARK-16455
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: YangyangLiu
>Priority: Minor
>
> In our case, we are implementing a new mechanism which will let driver 
> survive when cluster is temporarily down and restarting. So when the service 
> provided by cluster is not available, scheduler should stop scheduling new 
> tasks. I added a hook inside CoarseGrainedSchedulerBackend class, in order to 
> avoid new task scheduling when it's necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16455) Add a new hook in makeOffers in CoarseGrainedSchedulerBackend

2016-07-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16455:


Assignee: Apache Spark

> Add a new hook in makeOffers in CoarseGrainedSchedulerBackend
> -
>
> Key: SPARK-16455
> URL: https://issues.apache.org/jira/browse/SPARK-16455
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: YangyangLiu
>Assignee: Apache Spark
>Priority: Minor
>
> In our case, we are implementing restartable resource manager mechanism which 
> will let master executor survive when resource manager service is temporarily 
> down. So when resource manager service is not available, scheduler should 
> stop scheduling new tasks. I added a hook inside 
> CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling  
> when it's necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16455) Add a new hook in makeOffers in CoarseGrainedSchedulerBackend

2016-07-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16455:


Assignee: (was: Apache Spark)

> Add a new hook in makeOffers in CoarseGrainedSchedulerBackend
> -
>
> Key: SPARK-16455
> URL: https://issues.apache.org/jira/browse/SPARK-16455
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: YangyangLiu
>Priority: Minor
>
> In our case, we are implementing restartable resource manager mechanism which 
> will let master executor survive when resource manager service is temporarily 
> down. So when resource manager service is not available, scheduler should 
> stop scheduling new tasks. I added a hook inside 
> CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling  
> when it's necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16455) Add a new hook in makeOffers in CoarseGrainedSchedulerBackend

2016-07-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368744#comment-15368744
 ] 

Apache Spark commented on SPARK-16455:
--

User 'lovexi' has created a pull request for this issue:
https://github.com/apache/spark/pull/14110

> Add a new hook in makeOffers in CoarseGrainedSchedulerBackend
> -
>
> Key: SPARK-16455
> URL: https://issues.apache.org/jira/browse/SPARK-16455
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: YangyangLiu
>Priority: Minor
>
> In our case, we are implementing restartable resource manager mechanism which 
> will let master executor survive when resource manager service is temporarily 
> down. So when resource manager service is not available, scheduler should 
> stop scheduling new tasks. I added a hook inside 
> CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling  
> when it's necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16455) Add a new hook in makeOffers in CoarseGrainedSchedulerBackend

2016-07-08 Thread YangyangLiu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YangyangLiu updated SPARK-16455:

Description: In our case, we are implementing restartable resource manager 
mechanism which will let master executor survive when resource manager service 
is temporarily down. So when resource manager service is not available, 
scheduler should stop scheduling new tasks. I added a hook inside 
CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling  
when it's necessary.

> Add a new hook in makeOffers in CoarseGrainedSchedulerBackend
> -
>
> Key: SPARK-16455
> URL: https://issues.apache.org/jira/browse/SPARK-16455
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: YangyangLiu
>Priority: Minor
>
> In our case, we are implementing restartable resource manager mechanism which 
> will let master executor survive when resource manager service is temporarily 
> down. So when resource manager service is not available, scheduler should 
> stop scheduling new tasks. I added a hook inside 
> CoarseGrainedSchedulerBackend class, in order to avoid new task scheduling  
> when it's necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16455) Add a new hook in makeOffers in CoarseGrainedSchedulerBackend

2016-07-08 Thread YangyangLiu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YangyangLiu updated SPARK-16455:

Labels:   (was: newbie)

> Add a new hook in makeOffers in CoarseGrainedSchedulerBackend
> -
>
> Key: SPARK-16455
> URL: https://issues.apache.org/jira/browse/SPARK-16455
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: YangyangLiu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16424) Add support for Structured Streaming to the ML Pipeline API

2016-07-08 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-16424:

Description: 
For Spark 2.1 we should consider adding support for machine learning on top of 
the structured streaming API.
Early work in progress design outline: 
https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing

  was:For Spark 2.1 we should consider adding support for machine learning on 
top of the structured streaming API.


> Add support for Structured Streaming to the ML Pipeline API
> ---
>
> Key: SPARK-16424
> URL: https://issues.apache.org/jira/browse/SPARK-16424
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL, Streaming
>Reporter: holdenk
>
> For Spark 2.1 we should consider adding support for machine learning on top 
> of the structured streaming API.
> Early work in progress design outline: 
> https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16454) Consider adding a per-batch transform for structured streaming

2016-07-08 Thread holdenk (JIRA)
holdenk created SPARK-16454:
---

 Summary: Consider adding a per-batch transform for structured 
streaming
 Key: SPARK-16454
 URL: https://issues.apache.org/jira/browse/SPARK-16454
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Streaming
Reporter: holdenk


The new structured streaming API lacks the DStream functionality of transform 
(which allowed one to mix in existing RDD transformation logic). It would be 
useful to be able to do per-batch (even without any specific gaurantees about 
the batch being complete provided you eventually get called with the "catch up" 
records) processing as was done in the DStream API.

This might be useful for implementing Streaming Machine Learning on Structured 
Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer

2016-07-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16387.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Reserved SQL words are not escaped by JDBC writer
> -
>
> Key: SPARK-16387
> URL: https://issues.apache.org/jira/browse/SPARK-16387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Lev
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> Here is a code (imports are omitted)
> object Main extends App {
>   val sqlSession = SparkSession.builder().config(new SparkConf().
> setAppName("Sql Test").set("spark.app.id", "SQLTest").
> set("spark.master", "local[2]").
> set("spark.ui.enabled", "false")
> .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" ))
>   ).getOrCreate()
>   import sqlSession.implicits._
>   val localprops = new Properties
>   localprops.put("user", "")
>   localprops.put("password", "")
>   val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order")
>   val writer = df.write
>   .mode(SaveMode.Append)
>   writer
>   .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops)
> }
> End error is :
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'order TEXT )' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> Clearly the reserved word  has to be quoted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16404) LeastSquaresAggregator in Linear Regression serializes unnecessary data

2016-07-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16404:


Assignee: Apache Spark

> LeastSquaresAggregator in Linear Regression serializes unnecessary data
> ---
>
> Key: SPARK-16404
> URL: https://issues.apache.org/jira/browse/SPARK-16404
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> This is basically the same issue as 
> [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], but for 
> linear regression, where {{coefficients}} and {{featuresStd}} are 
> unnecessarily serialized between stages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16404) LeastSquaresAggregator in Linear Regression serializes unnecessary data

2016-07-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16404:


Assignee: (was: Apache Spark)

> LeastSquaresAggregator in Linear Regression serializes unnecessary data
> ---
>
> Key: SPARK-16404
> URL: https://issues.apache.org/jira/browse/SPARK-16404
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>
> This is basically the same issue as 
> [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], but for 
> linear regression, where {{coefficients}} and {{featuresStd}} are 
> unnecessarily serialized between stages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16404) LeastSquaresAggregator in Linear Regression serializes unnecessary data

2016-07-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368652#comment-15368652
 ] 

Apache Spark commented on SPARK-16404:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/14109

> LeastSquaresAggregator in Linear Regression serializes unnecessary data
> ---
>
> Key: SPARK-16404
> URL: https://issues.apache.org/jira/browse/SPARK-16404
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>
> This is basically the same issue as 
> [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], but for 
> linear regression, where {{coefficients}} and {{featuresStd}} are 
> unnecessarily serialized between stages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16453) release script does not publish spark-hive-thriftserver_2.10

2016-07-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16453.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14108
[https://github.com/apache/spark/pull/14108]

> release script does not publish spark-hive-thriftserver_2.10
> 
>
> Key: SPARK-16453
> URL: https://issues.apache.org/jira/browse/SPARK-16453
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward

2016-07-08 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368602#comment-15368602
 ] 

Manoj Kumar commented on SPARK-16365:
-

Could you be a bit more clearer about the first point? Is it so that people can 
quickly prototype locally with a small subsample of the data before doing the 
dataframe | RDD conversion to handle huge amounts of data?

> Ideas for moving "mllib-local" forward
> --
>
> Key: SPARK-16365
> URL: https://issues.apache.org/jira/browse/SPARK-16365
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next 
> steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's 
> linear algebra", or "investigate how we will implement local models/pipelines 
> in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation 
> of linalg into a standalone project turned out to be significantly more 
> complex than originally expected. So I vote we devote sufficient discussion 
> and time to planning out the next move :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-07-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368594#comment-15368594
 ] 

Shivaram Venkataraman commented on SPARK-15767:
---

Sorry I missed this thread. I agree with [~mengxr] that we should go with the 
`spark.algo` scheme and use the MLlib param names. In the future if we feel 
like we have significant overlap we can add a `rpart` wrapper that can mimic 
the existing package.

In terms of naming my vote would be to go with something like 
`spark.decisiontree` or `spark.randomforest` -- its slightly better to be 
explicit is my take.

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.rpart(dataframe, formula, ...) .  After having implemented 
> decision tree classification, we could refactor this two into an API more 
> like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368552#comment-15368552
 ] 

Dongjoon Hyun commented on SPARK-16452:
---

I see. Thanks.

> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support

2016-07-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368550#comment-15368550
 ] 

Reynold Xin commented on SPARK-16452:
-

org.apache.spark.sql.execution.systemcatalog ?

Something like that.


> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

2016-07-08 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368548#comment-15368548
 ] 

Xin Ren commented on SPARK-16437:
-

It's SQL's problem I think, I'll remove the SparkR tag

> SparkR read.df() from parquet got error: SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder"
> 
>
> Key: SPARK-16437
> URL: https://issues.apache.org/jira/browse/SPARK-16437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xin Ren
>Priority: Minor
>
> build SparkR with command
> {code}
> build/mvn -DskipTests -Psparkr package
> {code}
> start SparkR console
> {code}
> ./bin/sparkR
> {code}
> then get error
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> >
> >
> > library(SparkR)
> >
> > df <- read.df("examples/src/main/resources/users.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> >
> >
> > head(df)
> 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> name favorite_color favorite_numbers
> 1 Alyssa3, 9, 15, 20
> 2Benred NULL
> {code}
> Reference
> * seems need to add a lib from slf4j to point to older version
> http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
> * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

2016-07-08 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-16437:

Component/s: (was: SparkR)

> SparkR read.df() from parquet got error: SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder"
> 
>
> Key: SPARK-16437
> URL: https://issues.apache.org/jira/browse/SPARK-16437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xin Ren
>Priority: Minor
>
> build SparkR with command
> {code}
> build/mvn -DskipTests -Psparkr package
> {code}
> start SparkR console
> {code}
> ./bin/sparkR
> {code}
> then get error
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> >
> >
> > library(SparkR)
> >
> > df <- read.df("examples/src/main/resources/users.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> >
> >
> > head(df)
> 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> name favorite_color favorite_numbers
> 1 Alyssa3, 9, 15, 20
> 2Benred NULL
> {code}
> Reference
> * seems need to add a lib from slf4j to point to older version
> http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
> * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

2016-07-08 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368546#comment-15368546
 ] 

Shivaram Venkataraman commented on SPARK-16437:
---

FWIW this also happens with Scala (i.e. using bin/spark-shell). I dont think 
there is anything SparkR specific about this issue

> SparkR read.df() from parquet got error: SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder"
> 
>
> Key: SPARK-16437
> URL: https://issues.apache.org/jira/browse/SPARK-16437
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Reporter: Xin Ren
>Priority: Minor
>
> build SparkR with command
> {code}
> build/mvn -DskipTests -Psparkr package
> {code}
> start SparkR console
> {code}
> ./bin/sparkR
> {code}
> then get error
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> >
> >
> > library(SparkR)
> >
> > df <- read.df("examples/src/main/resources/users.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> >
> >
> > head(df)
> 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> name favorite_color favorite_numbers
> 1 Alyssa3, 9, 15, 20
> 2Benred NULL
> {code}
> Reference
> * seems need to add a lib from slf4j to point to older version
> http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
> * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-07-08 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368547#comment-15368547
 ] 

Xiangrui Meng commented on SPARK-15767:
---

This was discussed in SPARK-14831. We should call it `spark.algo(data, formula, 
method, required params, [optional params])` and use the same param names as in 
MLlib. But I'm not sure what method name to use here. We should think about 
method names for all tree methods together. cc [~josephkb]

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.rpart(dataframe, formula, ...) .  After having implemented 
> decision tree classification, we could refactor this two into an API more 
> like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368544#comment-15368544
 ] 

Dongjoon Hyun commented on SPARK-16452:
---

What package name do you prefer?

> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16452) basic INFORMATION_SCHEMA support

2016-07-08 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16452:
--
Comment: was deleted

(was: Is this the same or overlapped issue with 
https://issues.apache.org/jira/browse/SPARK-16201 ?)

> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16281) Implement parse_url SQL function

2016-07-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16281.
-
   Resolution: Fixed
 Assignee: Jian Wu
Fix Version/s: 2.0.1

> Implement parse_url SQL function
> 
>
> Key: SPARK-16281
> URL: https://issues.apache.org/jira/browse/SPARK-16281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Jian Wu
> Fix For: 2.0.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16429) Include `StringType` columns in `describe()`

2016-07-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16429.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.1.0

> Include `StringType` columns in `describe()`
> 
>
> Key: SPARK-16429
> URL: https://issues.apache.org/jira/browse/SPARK-16429
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently, Spark `describe` supports `StringType`. However, `describe()` 
> returns a dataset for only all numeric columns.
> This issue aims to include `StringType` columns in `describe()`, `describe` 
> without argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16453) release script does not publish spark-hive-thriftserver_2.10

2016-07-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368523#comment-15368523
 ] 

Apache Spark commented on SPARK-16453:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/14108

> release script does not publish spark-hive-thriftserver_2.10
> 
>
> Key: SPARK-16453
> URL: https://issues.apache.org/jira/browse/SPARK-16453
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16453) release script does not publish spark-hive-thriftserver_2.10

2016-07-08 Thread Yin Huai (JIRA)
Yin Huai created SPARK-16453:


 Summary: release script does not publish 
spark-hive-thriftserver_2.10
 Key: SPARK-16453
 URL: https://issues.apache.org/jira/browse/SPARK-16453
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

2016-07-08 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-16437:

Description: 
build SparkR with command
{code}
build/mvn -DskipTests -Psparkr package
{code}

start SparkR console
{code}
./bin/sparkR
{code}

then get error
{code}
 Welcome to
  __
   / __/__  ___ _/ /__
  _\ \/ _ \/ _ `/ __/  '_/
 /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
/_/


 SparkSession available as 'spark'.
>
>
> library(SparkR)
>
> df <- read.df("examples/src/main/resources/users.parquet")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
>
>
> head(df)
16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
name favorite_color favorite_numbers
1 Alyssa3, 9, 15, 20
2Benred NULL
{code}

Reference
* seems need to add a lib from slf4j to point to older version
http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
* on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder

  was:
build SparkR with command
{code}
build/mvn -DskipTests -Psparkr package
{code}

start SparkR console
{code}
./bin/sparkR
{code}

then get error
{code}
 Welcome to
  __
   / __/__  ___ _/ /__
  _\ \/ _ \/ _ `/ __/  '_/
 /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
/_/


 SparkSession available as 'spark'.
>
>
> library(SparkR)
>
> df <- read.df("examples/src/main/resources/users.parquet")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
>
>
> head(df)
16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
name favorite_color favorite_numbers
1 Alyssa3, 9, 15, 20
2Benred NULL
{code}

seems need to add a lib from slf4j
http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder


> SparkR read.df() from parquet got error: SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder"
> 
>
> Key: SPARK-16437
> URL: https://issues.apache.org/jira/browse/SPARK-16437
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Reporter: Xin Ren
>Priority: Minor
>
> build SparkR with command
> {code}
> build/mvn -DskipTests -Psparkr package
> {code}
> start SparkR console
> {code}
> ./bin/sparkR
> {code}
> then get error
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> >
> >
> > library(SparkR)
> >
> > df <- read.df("examples/src/main/resources/users.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> >
> >
> > head(df)
> 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> name favorite_color favorite_numbers
> 1 Alyssa3, 9, 15, 20
> 2Benred NULL
> {code}
> Reference
> * seems need to add a lib from slf4j to point to older version
> http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder
> * on slf4j official site: http://www.slf4j.org/codes.html#StaticLoggerBinder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16201) Expose information schema

2016-07-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-16201.
---
Resolution: Duplicate

> Expose information schema
> -
>
> Key: SPARK-16201
> URL: https://issues.apache.org/jira/browse/SPARK-16201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> As a Spark user I want to be able to query metadata, using information_schema 
> views. This is an umbrella ticket for adding this to Spark 2.1.
> This should support:
> - Databases
> - Tables
> - Views
> - Columns
> - Partitions
> - Functions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15630) 2.0 python coverage ml root module

2016-07-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15630:
--
Target Version/s: 2.0.0
Priority: Blocker  (was: Major)

> 2.0 python coverage ml root module
> --
>
> Key: SPARK-15630
> URL: https://issues.apache.org/jira/browse/SPARK-15630
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Blocker
>
> Audit the root pipeline components in PySpark ML for API compatibility. See 
> parent  SPARK-14813



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15623) 2.0 python coverage ml.feature

2016-07-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15623:
--
Target Version/s: 2.0.0
Priority: Blocker  (was: Major)

> 2.0 python coverage ml.feature
> --
>
> Key: SPARK-15623
> URL: https://issues.apache.org/jira/browse/SPARK-15623
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Blocker
>
> See parent task SPARK-14813.
> [~bryanc] did this component.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14813) ML 2.0 QA: API: Python API coverage

2016-07-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14813:
--
Target Version/s: 2.0.0
Priority: Blocker  (was: Major)

> ML 2.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-14813
> URL: https://issues.apache.org/jira/browse/SPARK-14813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: holdenk
>Priority: Blocker
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below as "requires") for this list of 
> to-do items.
> ** *NOTE: These missing features should be added in the next release.  This 
> work is just to generate a list of to-do items for the future.*
> UPDATE: This only needs to cover spark.ml since spark.mllib is going into 
> maintenance mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

2016-07-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16437:
--
Fix Version/s: (was: 2.0.0)

> SparkR read.df() from parquet got error: SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder"
> 
>
> Key: SPARK-16437
> URL: https://issues.apache.org/jira/browse/SPARK-16437
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Reporter: Xin Ren
>Priority: Minor
>
> build SparkR with command
> {code}
> build/mvn -DskipTests -Psparkr package
> {code}
> start SparkR console
> {code}
> ./bin/sparkR
> {code}
> then get error
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> >
> >
> > library(SparkR)
> >
> > df <- read.df("examples/src/main/resources/users.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> >
> >
> > head(df)
> 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> name favorite_color favorite_numbers
> 1 Alyssa3, 9, 15, 20
> 2Benred NULL
> {code}
> seems need to add a lib from slf4j
> http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368504#comment-15368504
 ] 

Dongjoon Hyun commented on SPARK-16452:
---

Is this the same or overlapped issue with 
https://issues.apache.org/jira/browse/SPARK-16201 ?

> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16437) SparkR read.df() from parquet got error: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

2016-07-08 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-16437:

Description: 
build SparkR with command
{code}
build/mvn -DskipTests -Psparkr package
{code}

start SparkR console
{code}
./bin/sparkR
{code}

then get error
{code}
 Welcome to
  __
   / __/__  ___ _/ /__
  _\ \/ _ \/ _ `/ __/  '_/
 /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
/_/


 SparkSession available as 'spark'.
>
>
> library(SparkR)
>
> df <- read.df("examples/src/main/resources/users.parquet")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
>
>
> head(df)
16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
name favorite_color favorite_numbers
1 Alyssa3, 9, 15, 20
2Benred NULL
{code}

seems need to add a lib from slf4j
http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder

  was:
start SparkR console
{code}
./bin/sparkR
{code}

then get error
{code}
 Welcome to
  __
   / __/__  ___ _/ /__
  _\ \/ _ \/ _ `/ __/  '_/
 /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
/_/


 SparkSession available as 'spark'.
>
>
> library(SparkR)
>
> df <- read.df("examples/src/main/resources/users.parquet")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
>
>
> head(df)
16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
name favorite_color favorite_numbers
1 Alyssa3, 9, 15, 20
2Benred NULL
{code}

seems need to add a lib from slf4j
http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder


> SparkR read.df() from parquet got error: SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder"
> 
>
> Key: SPARK-16437
> URL: https://issues.apache.org/jira/browse/SPARK-16437
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Reporter: Xin Ren
>Priority: Minor
> Fix For: 2.0.0
>
>
> build SparkR with command
> {code}
> build/mvn -DskipTests -Psparkr package
> {code}
> start SparkR console
> {code}
> ./bin/sparkR
> {code}
> then get error
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.0.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> >
> >
> > library(SparkR)
> >
> > df <- read.df("examples/src/main/resources/users.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> >
> >
> > head(df)
> 16/07/07 23:20:54 WARN ParquetRecordReader: Can not initialize counter due to 
> context is not a instance of TaskInputOutputContext, but is 
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> name favorite_color favorite_numbers
> 1 Alyssa3, 9, 15, 20
> 2Benred NULL
> {code}
> seems need to add a lib from slf4j
> http://stackoverflow.com/questions/7421612/slf4j-failed-to-load-class-org-slf4j-impl-staticloggerbinder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support

2016-07-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368497#comment-15368497
 ] 

Reynold Xin commented on SPARK-16452:
-

Thanks - feel free to break this down into multiple pieces.


> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16452) basic INFORMATION_SCHEMA support

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368494#comment-15368494
 ] 

Dongjoon Hyun commented on SPARK-16452:
---

Sure!

> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16452) basic INFORMATION_SCHEMA support

2016-07-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368486#comment-15368486
 ] 

Reynold Xin edited comment on SPARK-16452 at 7/8/16 9:12 PM:
-

cc [~dongjoon] would you be interested in working on this?



was (Author: rxin):
design spec

> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16452) basic INFORMATION_SCHEMA support

2016-07-08 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-16452:
---

 Summary: basic INFORMATION_SCHEMA support
 Key: SPARK-16452
 URL: https://issues.apache.org/jira/browse/SPARK-16452
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
 Attachments: INFORMATION_SCHEMAsupport.pdf

INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a few 
tables as defined in SQL92 standard to Spark SQL.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16452) basic INFORMATION_SCHEMA support

2016-07-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16452:

Attachment: INFORMATION_SCHEMAsupport.pdf

design spec

> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer

2016-07-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368482#comment-15368482
 ] 

Apache Spark commented on SPARK-16387:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14107

> Reserved SQL words are not escaped by JDBC writer
> -
>
> Key: SPARK-16387
> URL: https://issues.apache.org/jira/browse/SPARK-16387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Lev
>
> Here is a code (imports are omitted)
> object Main extends App {
>   val sqlSession = SparkSession.builder().config(new SparkConf().
> setAppName("Sql Test").set("spark.app.id", "SQLTest").
> set("spark.master", "local[2]").
> set("spark.ui.enabled", "false")
> .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" ))
>   ).getOrCreate()
>   import sqlSession.implicits._
>   val localprops = new Properties
>   localprops.put("user", "")
>   localprops.put("password", "")
>   val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order")
>   val writer = df.write
>   .mode(SaveMode.Append)
>   writer
>   .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops)
> }
> End error is :
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'order TEXT )' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> Clearly the reserved word  has to be quoted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer

2016-07-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16387:


Assignee: (was: Apache Spark)

> Reserved SQL words are not escaped by JDBC writer
> -
>
> Key: SPARK-16387
> URL: https://issues.apache.org/jira/browse/SPARK-16387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Lev
>
> Here is a code (imports are omitted)
> object Main extends App {
>   val sqlSession = SparkSession.builder().config(new SparkConf().
> setAppName("Sql Test").set("spark.app.id", "SQLTest").
> set("spark.master", "local[2]").
> set("spark.ui.enabled", "false")
> .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" ))
>   ).getOrCreate()
>   import sqlSession.implicits._
>   val localprops = new Properties
>   localprops.put("user", "")
>   localprops.put("password", "")
>   val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order")
>   val writer = df.write
>   .mode(SaveMode.Append)
>   writer
>   .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops)
> }
> End error is :
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'order TEXT )' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> Clearly the reserved word  has to be quoted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR

2016-07-08 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368479#comment-15368479
 ] 

Xin Ren commented on SPARK-16445:
-

Hi Xiangrui, may I have a try on this one? Is there a strict deadline to hit? 
Thanks a lot

> Multilayer Perceptron Classifier wrapper in SparkR
> --
>
> Key: SPARK-16445
> URL: https://issues.apache.org/jira/browse/SPARK-16445
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>
> Follow instructions in SPARK-16442 and implement multilayer perceptron 
> classifier wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10292) make metadata query-able by adding meta table

2016-07-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-10292.
---
Resolution: Later

> make metadata query-able by adding meta table
> -
>
> Key: SPARK-10292
> URL: https://issues.apache.org/jira/browse/SPARK-10292
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>
> To begin with, we can set up 2 meta tables: catalog_tables and 
> catalog_columns.
> catalog_tables have 3 columns: name, database, is_temporary.
> catalog_columns have 5 columns: name, table, database, data_type, nullable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit

2016-07-08 Thread Yesha Vora (JIRA)
Yesha Vora created SPARK-16451:
--

 Summary: Spark-shell / pyspark should finish gracefully when 
"SaslException: GSS initiate failed" is hit
 Key: SPARK-16451
 URL: https://issues.apache.org/jira/browse/SPARK-16451
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Yesha Vora


Steps to reproduce: (secure cluster)
* kdestroy
* spark-shell --master yarn-client

If no valid keytab is set while running spark-shell/pyspark, the spark client 
never exits. It keep printing below error messages. 
spark-client should call shutdown hook immediately and exit with proper error 
code.
Currently, user need to explicitly shutdown process. (using cntrl+c)

{code}
16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the 
server : 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at 
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
at 
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595)
at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756)
at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617)
at org.apache.hadoop.ipc.Client.call(Client.java:1448)
at org.apache.hadoop.ipc.Client.call(Client.java:1395)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437)
at 
org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:530)
at 
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
at $line3.$read$$iwC$$iwC.(:15)
at $line3.$read$$iwC.(:24)
at $line3.$read.(:26)
at $line3.$read$.(:30)
at $line3.$read$.()
at $line3.$eval$.(:7)
at $line3.$eval$.()
at $line3.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 

[jira] [Commented] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368458#comment-15368458
 ] 

Dongjoon Hyun commented on SPARK-16387:
---

You're right. I'll make a PR for this.

> Reserved SQL words are not escaped by JDBC writer
> -
>
> Key: SPARK-16387
> URL: https://issues.apache.org/jira/browse/SPARK-16387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Lev
>
> Here is a code (imports are omitted)
> object Main extends App {
>   val sqlSession = SparkSession.builder().config(new SparkConf().
> setAppName("Sql Test").set("spark.app.id", "SQLTest").
> set("spark.master", "local[2]").
> set("spark.ui.enabled", "false")
> .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" ))
>   ).getOrCreate()
>   import sqlSession.implicits._
>   val localprops = new Properties
>   localprops.put("user", "")
>   localprops.put("password", "")
>   val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order")
>   val writer = df.write
>   .mode(SaveMode.Append)
>   writer
>   .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops)
> }
> End error is :
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'order TEXT )' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> Clearly the reserved word  has to be quoted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer

2016-07-08 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16387:
--
Comment: was deleted

(was: Oh, it means Pull Request.
Since you know `JdbcDialect` class, I think you can make a code patch for that.)

> Reserved SQL words are not escaped by JDBC writer
> -
>
> Key: SPARK-16387
> URL: https://issues.apache.org/jira/browse/SPARK-16387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Lev
>
> Here is a code (imports are omitted)
> object Main extends App {
>   val sqlSession = SparkSession.builder().config(new SparkConf().
> setAppName("Sql Test").set("spark.app.id", "SQLTest").
> set("spark.master", "local[2]").
> set("spark.ui.enabled", "false")
> .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" ))
>   ).getOrCreate()
>   import sqlSession.implicits._
>   val localprops = new Properties
>   localprops.put("user", "")
>   localprops.put("password", "")
>   val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order")
>   val writer = df.write
>   .mode(SaveMode.Append)
>   writer
>   .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops)
> }
> End error is :
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'order TEXT )' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> Clearly the reserved word  has to be quoted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer

2016-07-08 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16387:
--
Comment: was deleted

(was: Hi,
`escaping` sounds possible, but it is not an easy issue to implement that 
portably for all DB.
We need to support various MySQL, PostgreSQL, MSSQL, and so on.
The standard is double quote ("), but even MySQL does not support that 
naturally. (Only supported in a ANSI mode?).
MySQL uses backtick (`), but PostgreSQL does not (if I remember correctly.) 
MSSQL uses '[]'.

I want to help you with this issue, but I've no idea. Do you have any idea for 
this? )

> Reserved SQL words are not escaped by JDBC writer
> -
>
> Key: SPARK-16387
> URL: https://issues.apache.org/jira/browse/SPARK-16387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Lev
>
> Here is a code (imports are omitted)
> object Main extends App {
>   val sqlSession = SparkSession.builder().config(new SparkConf().
> setAppName("Sql Test").set("spark.app.id", "SQLTest").
> set("spark.master", "local[2]").
> set("spark.ui.enabled", "false")
> .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" ))
>   ).getOrCreate()
>   import sqlSession.implicits._
>   val localprops = new Properties
>   localprops.put("user", "")
>   localprops.put("password", "")
>   val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order")
>   val writer = df.write
>   .mode(SaveMode.Append)
>   writer
>   .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops)
> }
> End error is :
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'order TEXT )' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> Clearly the reserved word  has to be quoted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16387) Reserved SQL words are not escaped by JDBC writer

2016-07-08 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16387:
--
Comment: was deleted

(was: Then, could you make a PR for this?)

> Reserved SQL words are not escaped by JDBC writer
> -
>
> Key: SPARK-16387
> URL: https://issues.apache.org/jira/browse/SPARK-16387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Lev
>
> Here is a code (imports are omitted)
> object Main extends App {
>   val sqlSession = SparkSession.builder().config(new SparkConf().
> setAppName("Sql Test").set("spark.app.id", "SQLTest").
> set("spark.master", "local[2]").
> set("spark.ui.enabled", "false")
> .setJars(Seq("/mysql/mysql-connector-java-5.1.38.jar" ))
>   ).getOrCreate()
>   import sqlSession.implicits._
>   val localprops = new Properties
>   localprops.put("user", "")
>   localprops.put("password", "")
>   val df = sqlSession.createDataset(Seq("a","b","c")).toDF("order")
>   val writer = df.write
>   .mode(SaveMode.Append)
>   writer
>   .jdbc("jdbc:mysql://localhost:3306/test3", s"jira_test", localprops)
> }
> End error is :
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error 
> in your SQL syntax; check the manual that corresponds to your MySQL server 
> version for the right syntax to use near 'order TEXT )' at line 1
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> Clearly the reserved word  has to be quoted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13346) Using DataFrames iteratively leads to slow query planning

2016-07-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13346:
--
Summary: Using DataFrames iteratively leads to slow query planning  (was: 
Using DataFrames iteratively leads to massive query plans, which slows 
execution)

> Using DataFrames iteratively leads to slow query planning
> -
>
> Key: SPARK-13346
> URL: https://issues.apache.org/jira/browse/SPARK-13346
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>
> I have an iterative algorithm based on DataFrames, and the query plan grows 
> very quickly with each iteration.  Caching the current DataFrame at the end 
> of an iteration does not fix the problem.  However, converting the DataFrame 
> to an RDD and back at the end of each iteration does fix the problem.
> Printing the query plans shows that the plan explodes quickly (10 lines, to 
> several hundred lines, to several thousand lines, ...) with successive 
> iterations.
> The desired behavior is for the analyzer to recognize that a big chunk of the 
> query plan does not need to be computed since it is already cached.  The 
> computation on each iteration should be the same.
> If useful, I can push (complex) code to reproduce the issue.  But it should 
> be simple to see if you create an iterative algorithm which produces a new 
> DataFrame from an old one on each iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-07-08 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368320#comment-15368320
 ] 

holdenk commented on SPARK-15581:
-

Yah - the more I look at it the more rough it seems - [~tdas] has it in his 
slides as targeted for Spark 2.1+ 
(http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming)

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main goal of the Python API 

[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-07-08 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368318#comment-15368318
 ] 

Nick Pentreath commented on SPARK-15581:


I think it would be a pretty interesting to explore a (probably fairly 
experimental) mechanism to train on structured streams/DFs and sink to a 
"prediction" stream and/or some "model state store"

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main 

[jira] [Commented] (SPARK-16365) Ideas for moving "mllib-local" forward

2016-07-08 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368310#comment-15368310
 ] 

Nick Pentreath commented on SPARK-16365:


Good question - and part of the reason for getting discussion going here. In 
general (IMO) the short answer is "no" - I think Spark should be the tool for 
training models on moderately large to extremely large datasets, but not 
necessarily for completely general machine learning.

I think the idea behind {{mllib-local}} is potentially two-fold: (i) make it 
easier to use Spark models / pipelines in production scenarios, and (ii) 
enhance linalg primitives available to devs / users.

> Ideas for moving "mllib-local" forward
> --
>
> Key: SPARK-16365
> URL: https://issues.apache.org/jira/browse/SPARK-16365
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Nick Pentreath
>
> Since SPARK-13944 is all done, we should all think about what the "next 
> steps" might be for {{mllib-local}}. E.g., it could be "improve Spark's 
> linear algebra", or "investigate how we will implement local models/pipelines 
> in Spark", etc.
> This ticket is for comments, ideas, brainstormings and PoCs. The separation 
> of linalg into a standalone project turned out to be significantly more 
> complex than originally expected. So I vote we devote sufficient discussion 
> and time to planning out the next move :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16420) UnsafeShuffleWriter leaks compression streams with off-heap memory.

2016-07-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16420.
-
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 2.0.0

> UnsafeShuffleWriter leaks compression streams with off-heap memory.
> ---
>
> Key: SPARK-16420
> URL: https://issues.apache.org/jira/browse/SPARK-16420
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 2.0.0
>
>
> When merging spill files using Java file streams, {{UnsafeShuffleWriter}} 
> opens a decompression stream for each shuffle part and [never closes 
> them|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java#L352-357].
>  When the compression codec holds off-heap resources, these aren't cleaned up 
> until the finalizer is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16450) Build failes for Mesos 0.28.x

2016-07-08 Thread Niels Becker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Becker updated SPARK-16450:
-
Comment: was deleted

(was: Spark 1.6.1 was working with Mesos 0.28.0)

> Build failes for Mesos 0.28.x
> -
>
> Key: SPARK-16450
> URL: https://issues.apache.org/jira/browse/SPARK-16450
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0
> Environment: Mesos 0.28.0
>Reporter: Niels Becker
>
> Build fails:
> [error] 
> /usr/local/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:82:
>  type mismatch;
> [error]  found   : org.apache.mesos.protobuf.ByteString
> [error]  required: String
> [error]   credBuilder.setSecret(ByteString.copyFromUtf8(secret))
> Build cmd:
> dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive 
> -DskipTests -Dmesos.version=0.28.0 -Djava.version=1.8
> Spark Version: 2.0.0-rc2
> Java: OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-1~bpo8+1-b14
> Scala Version: 2.11.8
> Same error for mesos.version=0.28.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368249#comment-15368249
 ] 

Dongjoon Hyun commented on SPARK-16449:
---

SPARK-16173 resolved that kind of issue.

> unionAll raises "Task not serializable"
> ---
>
> Key: SPARK-16449
> URL: https://issues.apache.org/jira/browse/SPARK-16449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: AWS EMR, Jupyter notebook
>Reporter: Jeff Levy
>Priority: Minor
>
> Goal: Take the output from `describe` on a large DataFrame, then use a loop 
> to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each 
> column, build them into a DataFrame of two rows, then use `unionAll` to merge 
> them together.
> Issue: Despite having the same column names, in the same order with the same 
> dtypes, the `unionAll` fails with "Task not serializable".  However, if I 
> build two test rows using dummy data then `unionAll` works fine.  Also, if I 
> collect my results then turn them straight back into DataFrames, `unionAll` 
> succeeds.  
> Step-by-step code and output with comments can be seen here: 
> https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
> The issue appears to be in the way the loop in code block 6 is building the 
> rows before parallelizing, but the results look no different from the test 
> rows that do work.  I reproduced this on multiple datasets, so downloading 
> the notebook and pointing it to any data of your own should replicate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16449:
--
Comment: was deleted

(was: Oh, I see.)

> unionAll raises "Task not serializable"
> ---
>
> Key: SPARK-16449
> URL: https://issues.apache.org/jira/browse/SPARK-16449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: AWS EMR, Jupyter notebook
>Reporter: Jeff Levy
>Priority: Minor
>
> Goal: Take the output from `describe` on a large DataFrame, then use a loop 
> to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each 
> column, build them into a DataFrame of two rows, then use `unionAll` to merge 
> them together.
> Issue: Despite having the same column names, in the same order with the same 
> dtypes, the `unionAll` fails with "Task not serializable".  However, if I 
> build two test rows using dummy data then `unionAll` works fine.  Also, if I 
> collect my results then turn them straight back into DataFrames, `unionAll` 
> succeeds.  
> Step-by-step code and output with comments can be seen here: 
> https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
> The issue appears to be in the way the loop in code block 6 is building the 
> rows before parallelizing, but the results look no different from the test 
> rows that do work.  I reproduced this on multiple datasets, so downloading 
> the notebook and pointing it to any data of your own should replicate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368245#comment-15368245
 ] 

Dongjoon Hyun commented on SPARK-16449:
---

It seems you are using the result of `describe` at the `unionAll` directly. I 
mean `df_described`.
```
df_described = df.describe()
expanded_describe = df_described.unionAll(df_described2)
expanded_describe.show()
```

> unionAll raises "Task not serializable"
> ---
>
> Key: SPARK-16449
> URL: https://issues.apache.org/jira/browse/SPARK-16449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: AWS EMR, Jupyter notebook
>Reporter: Jeff Levy
>Priority: Minor
>
> Goal: Take the output from `describe` on a large DataFrame, then use a loop 
> to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each 
> column, build them into a DataFrame of two rows, then use `unionAll` to merge 
> them together.
> Issue: Despite having the same column names, in the same order with the same 
> dtypes, the `unionAll` fails with "Task not serializable".  However, if I 
> build two test rows using dummy data then `unionAll` works fine.  Also, if I 
> collect my results then turn them straight back into DataFrames, `unionAll` 
> succeeds.  
> Step-by-step code and output with comments can be seen here: 
> https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
> The issue appears to be in the way the loop in code block 6 is building the 
> rows before parallelizing, but the results look no different from the test 
> rows that do work.  I reproduced this on multiple datasets, so downloading 
> the notebook and pointing it to any data of your own should replicate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368232#comment-15368232
 ] 

Dongjoon Hyun commented on SPARK-16449:
---

Oh, I see.

> unionAll raises "Task not serializable"
> ---
>
> Key: SPARK-16449
> URL: https://issues.apache.org/jira/browse/SPARK-16449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: AWS EMR, Jupyter notebook
>Reporter: Jeff Levy
>Priority: Minor
>
> Goal: Take the output from `describe` on a large DataFrame, then use a loop 
> to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each 
> column, build them into a DataFrame of two rows, then use `unionAll` to merge 
> them together.
> Issue: Despite having the same column names, in the same order with the same 
> dtypes, the `unionAll` fails with "Task not serializable".  However, if I 
> build two test rows using dummy data then `unionAll` works fine.  Also, if I 
> collect my results then turn them straight back into DataFrames, `unionAll` 
> succeeds.  
> Step-by-step code and output with comments can be seen here: 
> https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
> The issue appears to be in the way the loop in code block 6 is building the 
> rows before parallelizing, but the results look no different from the test 
> rows that do work.  I reproduced this on multiple datasets, so downloading 
> the notebook and pointing it to any data of your own should replicate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368231#comment-15368231
 ] 

Dongjoon Hyun commented on SPARK-16449:
---

Unfortunately, I cannot test the notebook due to the given S3 bucket is not 
public.
You can try that in Databricks Community Edition. There is 2.0.0 RC2 including 
SPARK-16173 .

> unionAll raises "Task not serializable"
> ---
>
> Key: SPARK-16449
> URL: https://issues.apache.org/jira/browse/SPARK-16449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: AWS EMR, Jupyter notebook
>Reporter: Jeff Levy
>Priority: Minor
>
> Goal: Take the output from `describe` on a large DataFrame, then use a loop 
> to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each 
> column, build them into a DataFrame of two rows, then use `unionAll` to merge 
> them together.
> Issue: Despite having the same column names, in the same order with the same 
> dtypes, the `unionAll` fails with "Task not serializable".  However, if I 
> build two test rows using dummy data then `unionAll` works fine.  Also, if I 
> collect my results then turn them straight back into DataFrames, `unionAll` 
> succeeds.  
> Step-by-step code and output with comments can be seen here: 
> https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
> The issue appears to be in the way the loop in code block 6 is building the 
> rows before parallelizing, but the results look no different from the test 
> rows that do work.  I reproduced this on multiple datasets, so downloading 
> the notebook and pointing it to any data of your own should replicate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Jeff Levy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368226#comment-15368226
 ] 

Jeff Levy commented on SPARK-16449:
---

The problem doesn't appear to be with `describe` here.  If I leave the describe 
DataFrame unchanged but turn the data for merging into a list via collect and 
then back into a DataFrame, `unionAll` works.

> unionAll raises "Task not serializable"
> ---
>
> Key: SPARK-16449
> URL: https://issues.apache.org/jira/browse/SPARK-16449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: AWS EMR, Jupyter notebook
>Reporter: Jeff Levy
>Priority: Minor
>
> Goal: Take the output from `describe` on a large DataFrame, then use a loop 
> to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each 
> column, build them into a DataFrame of two rows, then use `unionAll` to merge 
> them together.
> Issue: Despite having the same column names, in the same order with the same 
> dtypes, the `unionAll` fails with "Task not serializable".  However, if I 
> build two test rows using dummy data then `unionAll` works fine.  Also, if I 
> collect my results then turn them straight back into DataFrames, `unionAll` 
> succeeds.  
> Step-by-step code and output with comments can be seen here: 
> https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
> The issue appears to be in the way the loop in code block 6 is building the 
> rows before parallelizing, but the results look no different from the test 
> rows that do work.  I reproduced this on multiple datasets, so downloading 
> the notebook and pointing it to any data of your own should replicate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory

2016-07-08 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368216#comment-15368216
 ] 

Manoj Kumar commented on SPARK-3728:


Hi [~xusen]. Are you still working on this?

> RandomForest: Learn models too large to store in memory
> ---
>
> Key: SPARK-3728
> URL: https://issues.apache.org/jira/browse/SPARK-3728
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Proposal: Write trees to disk as they are learned.
> RandomForest currently uses a FIFO queue, which means training all trees at 
> once via breadth-first search.  Using a FILO queue would encourage the code 
> to finish one tree before moving on to new ones.  This would allow the code 
> to write trees to disk as they are learned.
> Note: It would also be possible to write nodes to disk as they are learned 
> using a FIFO queue, once the example--node mapping is cached [JIRA].  The 
> [Sequoia Forest package]() does this.  However, it could be useful to learn 
> trees progressively, so that future functionality such as early stopping 
> (training fewer trees than expected) could be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368204#comment-15368204
 ] 

Dongjoon Hyun edited comment on SPARK-16449 at 7/8/16 7:03 PM:
---

In that issue, the root cause was `describe` itself.
It was merged into both master and 1.6 branch.


was (Author: dongjoon):
In that issue, the root cause was `describe` itself.

> unionAll raises "Task not serializable"
> ---
>
> Key: SPARK-16449
> URL: https://issues.apache.org/jira/browse/SPARK-16449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: AWS EMR, Jupyter notebook
>Reporter: Jeff Levy
>Priority: Minor
>
> Goal: Take the output from `describe` on a large DataFrame, then use a loop 
> to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each 
> column, build them into a DataFrame of two rows, then use `unionAll` to merge 
> them together.
> Issue: Despite having the same column names, in the same order with the same 
> dtypes, the `unionAll` fails with "Task not serializable".  However, if I 
> build two test rows using dummy data then `unionAll` works fine.  Also, if I 
> collect my results then turn them straight back into DataFrames, `unionAll` 
> succeeds.  
> Step-by-step code and output with comments can be seen here: 
> https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
> The issue appears to be in the way the loop in code block 6 is building the 
> rows before parallelizing, but the results look no different from the test 
> rows that do work.  I reproduced this on multiple datasets, so downloading 
> the notebook and pointing it to any data of your own should replicate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368204#comment-15368204
 ] 

Dongjoon Hyun commented on SPARK-16449:
---

In that issue, the root cause was `describe` itself.

> unionAll raises "Task not serializable"
> ---
>
> Key: SPARK-16449
> URL: https://issues.apache.org/jira/browse/SPARK-16449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: AWS EMR, Jupyter notebook
>Reporter: Jeff Levy
>Priority: Minor
>
> Goal: Take the output from `describe` on a large DataFrame, then use a loop 
> to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each 
> column, build them into a DataFrame of two rows, then use `unionAll` to merge 
> them together.
> Issue: Despite having the same column names, in the same order with the same 
> dtypes, the `unionAll` fails with "Task not serializable".  However, if I 
> build two test rows using dummy data then `unionAll` works fine.  Also, if I 
> collect my results then turn them straight back into DataFrames, `unionAll` 
> succeeds.  
> Step-by-step code and output with comments can be seen here: 
> https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
> The issue appears to be in the way the loop in code block 6 is building the 
> rows before parallelizing, but the results look no different from the test 
> rows that do work.  I reproduced this on multiple datasets, so downloading 
> the notebook and pointing it to any data of your own should replicate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16450) Build failes for Mesos 0.28.x

2016-07-08 Thread Niels Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368199#comment-15368199
 ] 

Niels Becker commented on SPARK-16450:
--

Spark 1.6.1 was working with Mesos 0.28.0

> Build failes for Mesos 0.28.x
> -
>
> Key: SPARK-16450
> URL: https://issues.apache.org/jira/browse/SPARK-16450
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0
> Environment: Mesos 0.28.0
>Reporter: Niels Becker
>
> Build fails:
> [error] 
> /usr/local/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:82:
>  type mismatch;
> [error]  found   : org.apache.mesos.protobuf.ByteString
> [error]  required: String
> [error]   credBuilder.setSecret(ByteString.copyFromUtf8(secret))
> Build cmd:
> dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive 
> -DskipTests -Dmesos.version=0.28.0 -Djava.version=1.8
> Spark Version: 2.0.0-rc2
> Java: OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-1~bpo8+1-b14
> Scala Version: 2.11.8
> Same error for mesos.version=0.28.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368201#comment-15368201
 ] 

Dongjoon Hyun commented on SPARK-16449:
---

Could you try this in master?
It seems to be similar with SPARK-16173 .

> unionAll raises "Task not serializable"
> ---
>
> Key: SPARK-16449
> URL: https://issues.apache.org/jira/browse/SPARK-16449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: AWS EMR, Jupyter notebook
>Reporter: Jeff Levy
>Priority: Minor
>
> Goal: Take the output from `describe` on a large DataFrame, then use a loop 
> to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each 
> column, build them into a DataFrame of two rows, then use `unionAll` to merge 
> them together.
> Issue: Despite having the same column names, in the same order with the same 
> dtypes, the `unionAll` fails with "Task not serializable".  However, if I 
> build two test rows using dummy data then `unionAll` works fine.  Also, if I 
> collect my results then turn them straight back into DataFrames, `unionAll` 
> succeeds.  
> Step-by-step code and output with comments can be seen here: 
> https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
> The issue appears to be in the way the loop in code block 6 is building the 
> rows before parallelizing, but the results look no different from the test 
> rows that do work.  I reproduced this on multiple datasets, so downloading 
> the notebook and pointing it to any data of your own should replicate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13638) Support for saving with a quote mode

2016-07-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13638.
-
   Resolution: Fixed
 Assignee: Jurriaan Pruis
Fix Version/s: 2.0.1

> Support for saving with a quote mode
> 
>
> Key: SPARK-13638
> URL: https://issues.apache.org/jira/browse/SPARK-13638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Jurriaan Pruis
>Priority: Minor
> Fix For: 2.0.1
>
>
> https://github.com/databricks/spark-csv/pull/254
> tobithiel reported this.
> {quote}
> I'm dealing with some messy csv files and being able to just quote all fields 
> is very useful,
> so that other applications don't misunderstand the file because of some 
> sketchy characters
> {quote}
> When writing there are several quote modes in apache commons csv. (See 
> https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)
> This might have to be supported.
> However, it looks univocity parser used for writing (it looks currently only 
> this library is supported) does not support this quote mode. I think we can 
> drop this backwards compatibility if we are not going to add apache commons 
> csv.
> This is a reminder that it might break backwards compatibility for the 
> options, {{quoteMode}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16447) LDA wrapper in SparkR

2016-07-08 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368149#comment-15368149
 ] 

Xusen Yin commented on SPARK-16447:
---

[~mengxr] I'd like to work on this.

> LDA wrapper in SparkR
> -
>
> Key: SPARK-16447
> URL: https://issues.apache.org/jira/browse/SPARK-16447
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>
> Follow instructions in SPARK-16442 and implement LDA wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368143#comment-15368143
 ] 

Sean Owen commented on SPARK-16449:
---

Interesting. Looks like something is serializing a Scala Iterator and it 
doesn't work. 

The relevant subset is below. Hm, maybe something in LocalTableScan can be 
rejiggered to avoid this.

{code}
Caused by: java.io.NotSerializableException: scala.collection.Iterator$$anon$11
Serialization stack:
- object not serializable (class: scala.collection.Iterator$$anon$11, 
value: empty iterator)
- field (class: scala.collection.Iterator$$anonfun$toStream$1, name: 
$outer, type: interface scala.collection.Iterator)
- object (class scala.collection.Iterator$$anonfun$toStream$1, 
)
- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
interface scala.Function0)
- object (class scala.collection.immutable.Stream$Cons, 
Stream(WrappedArray(3526154, 3526154, 1580402, 3526154, 3526154), 
WrappedArray(5.50388599500189E11, 4.178168090221903, 234846.780654818, 
5.134865351881966, 354.7084951479714), WrappedArray(2.596112361975223E11, 
0.34382335723646484, 118170.68592261613, 3.3833930336063456, 
4.011812510792076), WrappedArray(12091588, 2.75, 0.85, -1, 292), 
WrappedArray(95696635, 6.125, 1193544.39, 34, 480)))
- field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: 
$outer, type: class scala.collection.immutable.Stream)
- object (class scala.collection.immutable.Stream$$anonfun$zip$1, 
)
- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
interface scala.Function0)
- object (class scala.collection.immutable.Stream$Cons, 
Stream((WrappedArray(3526154, 3526154, 1580402, 3526154, 
3526154),(count,)), (WrappedArray(5.50388599500189E11, 
4.178168090221903, 234846.780654818, 5.134865351881966, 
354.7084951479714),(mean,)), (WrappedArray(2.596112361975223E11, 
0.34382335723646484, 118170.68592261613, 3.3833930336063456, 
4.011812510792076),(stddev,)), (WrappedArray(12091588, 2.75, 
0.85, -1, 292),(min,)), (WrappedArray(95696635, 6.125, 
1193544.39, 34, 480),(max,
- field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
$outer, type: class scala.collection.immutable.Stream)
- object (class scala.collection.immutable.Stream$$anonfun$map$1, 
)
- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
interface scala.Function0)
- object (class scala.collection.immutable.Stream$Cons, 
Stream([count,3526154,3526154,1580402,3526154,3526154], 
[mean,5.50388599500189E11,4.178168090221903,234846.780654818,5.134865351881966,354.7084951479714],
 
[stddev,2.596112361975223E11,0.34382335723646484,118170.68592261613,3.3833930336063456,4.011812510792076],
 [min,12091588,2.75,0.85,-1,292], 
[max,95696635,6.125,1193544.39,34,480]))
- field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: 
$outer, type: class scala.collection.immutable.Stream)
- object (class scala.collection.immutable.Stream$$anonfun$map$1, 
)
- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: 
interface scala.Function0)
- object (class scala.collection.immutable.Stream$Cons, 
Stream([count,3526154,3526154,1580402,3526154,3526154], 
[mean,5.50388599500189E11,4.178168090221903,234846.780654818,5.134865351881966,354.7084951479714],
 
[stddev,2.596112361975223E11,0.34382335723646484,118170.68592261613,3.3833930336063456,4.011812510792076],
 [min,12091588,2.75,0.85,-1,292], 
[max,95696635,6.125,1193544.39,34,480]))
- field (class: org.apache.spark.sql.execution.LocalTableScan, name: 
rows, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.execution.LocalTableScan, 
LocalTableScan [summary#228,C0#229,C3#230,C4#231,C5#232,C6#233], 
[[count,3526154,3526154,1580402,3526154,3526154],[mean,5.50388599500189E11,4.178168090221903,234846.780654818,5.134865351881966,354.7084951479714],[stddev,2.596112361975223E11,0.34382335723646484,118170.68592261613,3.3833930336063456,4.011812510792076],[min,12091588,2.75,0.85,-1,292],[max,95696635,6.125,1193544.39,34,480]]
)
- field (class: org.apache.spark.sql.execution.ConvertToUnsafe, name: 
child, type: class org.apache.spark.sql.execution.SparkPlan)
- object (class org.apache.spark.sql.execution.ConvertToUnsafe, 
ConvertToUnsafe
+- LocalTableScan [summary#228,C0#229,C3#230,C4#231,C5#232,C6#233], 
[[count,3526154,3526154,1580402,3526154,3526154],[mean,5.50388599500189E11,4.178168090221903,234846.780654818,5.134865351881966,354.7084951479714],[stddev,2.596112361975223E11,0.34382335723646484,118170.68592261613,3.3833930336063456,4.011812510792076],[min,12091588,2.75,0.85,-1,292],[max,95696635,6.125,1193544.39,34,480]]
)
- field (class: 

[jira] [Commented] (SPARK-16450) Build failes for Mesos 0.28.x

2016-07-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368138#comment-15368138
 ] 

Sean Owen commented on SPARK-16450:
---

I don't think this version of Mesos is supported; it sounds like the API 
changed in a mutually incompatible way.

> Build failes for Mesos 0.28.x
> -
>
> Key: SPARK-16450
> URL: https://issues.apache.org/jira/browse/SPARK-16450
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0
> Environment: Mesos 0.28.0
>Reporter: Niels Becker
>
> Build fails:
> [error] 
> /usr/local/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:82:
>  type mismatch;
> [error]  found   : org.apache.mesos.protobuf.ByteString
> [error]  required: String
> [error]   credBuilder.setSecret(ByteString.copyFromUtf8(secret))
> Build cmd:
> dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive 
> -DskipTests -Dmesos.version=0.28.0 -Djava.version=1.8
> Spark Version: 2.0.0-rc2
> Java: OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-1~bpo8+1-b14
> Scala Version: 2.11.8
> Same error for mesos.version=0.28.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16450) Build failes for Mesos 0.28.x

2016-07-08 Thread Niels Becker (JIRA)
Niels Becker created SPARK-16450:


 Summary: Build failes for Mesos 0.28.x
 Key: SPARK-16450
 URL: https://issues.apache.org/jira/browse/SPARK-16450
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.0.0
 Environment: Mesos 0.28.0
Reporter: Niels Becker


Build fails:
[error] 
/usr/local/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:82:
 type mismatch;
[error]  found   : org.apache.mesos.protobuf.ByteString
[error]  required: String
[error]   credBuilder.setSecret(ByteString.copyFromUtf8(secret))

Build cmd:
dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive 
-DskipTests -Dmesos.version=0.28.0 -Djava.version=1.8

Spark Version: 2.0.0-rc2
Java: OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-1~bpo8+1-b14
Scala Version: 2.11.8

Same error for mesos.version=0.28.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Jeff Levy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Levy updated SPARK-16449:
--
Description: 
Goal: Take the output from `describe` on a large DataFrame, then use a loop to 
calculate `skewness` and `kurtosis` from pyspark.sql.functions for each column, 
build them into a DataFrame of two rows, then use `unionAll` to merge them 
together.

Issue: Despite having the same column names, in the same order with the same 
dtypes, the `unionAll` fails with "Task not serializable".  However, if I build 
two test rows using dummy data then `unionAll` works fine.  Also, if I collect 
my results then turn them straight back into DataFrames, `unionAll` succeeds.  


Step-by-step code and output with comments can be seen here: 
https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
The issue appears to be in the way the loop in code block 6 is building the 
rows before parallelizing, but the results look no different from the test rows 
that do work.  I reproduced this on multiple datasets, so downloading the 
notebook and pointing it to any data of your own should replicate it.

  was:
Goal: Take the output from `describe` on a large DataFrame, then use a loop to 
calculate `skewness` and `kurtosis` from pyspark.sql.functions for each column, 
build them into a DataFrame of two rows, then use `unionAll` to merge them 
together.

Issue: Despite having the same column names, in the same order with the same 
dtypes, the `unionAll` fails with "Task not serializable".  However, if I build 
two test rows using dummy data then `unionAll` works fine.  Also, if I collect 
my results then turn them straight back into DataFrames, `unionAll` succeeds.  


Step-by-step code and output with comments can be seen here: 
https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
The issue appears to be in the way the loop in code block 6 is building the 
rows before parallelizing, but the results look no different from the test rows 
that do work.


> unionAll raises "Task not serializable"
> ---
>
> Key: SPARK-16449
> URL: https://issues.apache.org/jira/browse/SPARK-16449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: AWS EMR, Jupyter notebook
>Reporter: Jeff Levy
>Priority: Minor
>
> Goal: Take the output from `describe` on a large DataFrame, then use a loop 
> to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each 
> column, build them into a DataFrame of two rows, then use `unionAll` to merge 
> them together.
> Issue: Despite having the same column names, in the same order with the same 
> dtypes, the `unionAll` fails with "Task not serializable".  However, if I 
> build two test rows using dummy data then `unionAll` works fine.  Also, if I 
> collect my results then turn them straight back into DataFrames, `unionAll` 
> succeeds.  
> Step-by-step code and output with comments can be seen here: 
> https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
> The issue appears to be in the way the loop in code block 6 is building the 
> rows before parallelizing, but the results look no different from the test 
> rows that do work.  I reproduced this on multiple datasets, so downloading 
> the notebook and pointing it to any data of your own should replicate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16409) regexp_extract with optional groups causes NPE

2016-07-08 Thread Max Moroz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Moroz updated SPARK-16409:
--
Description: 
df = sqlContext.createDataFrame([['c']], ['s'])
df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect()

causes NPE. Worse, in a large program it doesn't cause NPE instantly; it 
actually works fine, until some unpredictable (and inconsistent) moment in the 
future when (presumably) the invalid memory access occurs, and then it fails. 
For this reason, it took several hours to debug this.

Suggestion: either fill the group with null; or raise exception immediately 
after examining the argument with a message that optional groups are not 
allowed.

Traceback:

---
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect()

C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\pyspark\sql\dataframe.py
 in collect(self)
294 """
295 with SCCallSiteSync(self._sc) as css:
--> 296 port = self._jdf.collectToPython()
297 return list(_load_from_socket(port, 
BatchedSerializer(PickleSerializer(
298 

C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\java_gateway.py
 in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934 
935 for temp_arg in temp_args:

C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\pyspark\sql\utils.py
 in deco(*a, **kw)
 55 def deco(*a, **kw):
 56 try:
---> 57 return f(*a, **kw)
 58 except py4j.protocol.Py4JJavaError as e:
 59 s = e.java_exception.toString()

C:\Users\me\Downloads\spark-2.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.1-src.zip\py4j\protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(

Py4JJavaError: An error occurred while calling o51.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:883)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:883)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1889)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1889)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

[jira] [Comment Edited] (SPARK-16439) Incorrect information in SQL Query details

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368056#comment-15368056
 ] 

Dongjoon Hyun edited comment on SPARK-16439 at 7/8/16 5:50 PM:
---

For me, it seems to work.

https://app.box.com/s/lw0kl1ft7z4od9fwamtoafipzfrnex8m

Could you provide more information?


was (Author: dongjoon):
For me, it works.

https://app.box.com/representation/file_version_77719788413/image_2048/1.png

Could you provide more information?

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16409) regexp_extract with optional groups causes NPE

2016-07-08 Thread Max Moroz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368059#comment-15368059
 ] 

Max Moroz commented on SPARK-16409:
---

[~srowen] So sorry I was sure I copied the entire code. I'm gonna update the 
issue with the full details.


> regexp_extract with optional groups causes NPE
> --
>
> Key: SPARK-16409
> URL: https://issues.apache.org/jira/browse/SPARK-16409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>
> df.select(F.regexp_extract('s', r'(a+)(b)?(c)', 2)).collect()
> causes NPE. Worse, in a large program it doesn't cause NPE instantly; it 
> actually works fine, until some unpredictable (and inconsistent) moment in 
> the future when (presumably) the invalid memory access occurs, and then it 
> fails. For this reason, it took several hours to debug this.
> Suggestion: either fill the group with null; or raise exception immediately 
> after examining the argument with a message that optional groups are not 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details

2016-07-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368056#comment-15368056
 ] 

Dongjoon Hyun commented on SPARK-16439:
---

For me, it works.

https://app.box.com/representation/file_version_77719788413/image_2048/1.png

Could you provide more information?

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Jeff Levy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Levy updated SPARK-16449:
--
Description: 
Goal: Take the output from `describe` on a large DataFrame, then use a loop to 
calculate `skewness` and `kurtosis` from pyspark.sql.functions for each column, 
build them into a DataFrame of two rows, then use `unionAll` to merge them 
together.

Issue: Despite having the same column names, in the same order with the same 
dtypes, the `unionAll` fails with "Task not serializable".  However, if I build 
two test rows using dummy data then `unionAll` works fine.  Also, if I collect 
my results then turn them straight back into DataFrames, `unionAll` succeeds.  


Step-by-step code and output with comments can be seen here: 
https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
The issue appears to be in the way the loop in code block 6 is building the 
rows before parallelizing, but the results look no different from the test rows 
that do work.

  was:
Goal: Take the output from `describe` on a large DataFrame, then use a loop to 
calculate `skewness` and `kurtosis` from pyspark.sql.functions for each column, 
build them into a DataFrame of two rows, then use `unionAll` to merge them 
together.

Issue: Despite having the same column names, in the same order with the same 
dtypes, the `unionAll` fails with "Task not serializable".  However, if I build 
two test rows using dummy data then `unionAll` works fine.  Also, if I collect 
my results then turn them straight back into DataFrames, `unionAll` succeeds.  


Step-by-step code and output with comments can be seen here: 
https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb


> unionAll raises "Task not serializable"
> ---
>
> Key: SPARK-16449
> URL: https://issues.apache.org/jira/browse/SPARK-16449
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: AWS EMR, Jupyter notebook
>Reporter: Jeff Levy
>Priority: Minor
>
> Goal: Take the output from `describe` on a large DataFrame, then use a loop 
> to calculate `skewness` and `kurtosis` from pyspark.sql.functions for each 
> column, build them into a DataFrame of two rows, then use `unionAll` to merge 
> them together.
> Issue: Despite having the same column names, in the same order with the same 
> dtypes, the `unionAll` fails with "Task not serializable".  However, if I 
> build two test rows using dummy data then `unionAll` works fine.  Also, if I 
> collect my results then turn them straight back into DataFrames, `unionAll` 
> succeeds.  
> Step-by-step code and output with comments can be seen here: 
> https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb
> The issue appears to be in the way the loop in code block 6 is building the 
> rows before parallelizing, but the results look no different from the test 
> rows that do work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16444) Isotonic Regression wrapper in SparkR

2016-07-08 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368044#comment-15368044
 ] 

Miao Wang commented on SPARK-16444:
---

[~mengxr] Sure! I am done enough documents and examples last a couple of weeks. 
Glad to work something new. Thanks!

> Isotonic Regression wrapper in SparkR
> -
>
> Key: SPARK-16444
> URL: https://issues.apache.org/jira/browse/SPARK-16444
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>
> Implement Isotonic Regression wrapper and other utils in SparkR.
> {code}
> spark.isotonicRegression(data, formula, ...)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16449) unionAll raises "Task not serializable"

2016-07-08 Thread Jeff Levy (JIRA)
Jeff Levy created SPARK-16449:
-

 Summary: unionAll raises "Task not serializable"
 Key: SPARK-16449
 URL: https://issues.apache.org/jira/browse/SPARK-16449
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.1
 Environment: AWS EMR, Jupyter notebook
Reporter: Jeff Levy
Priority: Minor


Goal: Take the output from `describe` on a large DataFrame, then use a loop to 
calculate `skewness` and `kurtosis` from pyspark.sql.functions for each column, 
build them into a DataFrame of two rows, then use `unionAll` to merge them 
together.

Issue: Despite having the same column names, in the same order with the same 
dtypes, the `unionAll` fails with "Task not serializable".  However, if I build 
two test rows using dummy data then `unionAll` works fine.  Also, if I collect 
my results then turn them straight back into DataFrames, `unionAll` succeeds.  


Step-by-step code and output with comments can be seen here: 
https://github.com/UrbanInstitute/pyspark-tutorials/blob/master/unionAll%20error.ipynb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-07-08 Thread Kai Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367997#comment-15367997
 ] 

Kai Jiang commented on SPARK-15767:
---

[~mengxr] Would you mind to give some comments on how to design this api?

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.rpart(dataframe, formula, ...) .  After having implemented 
> decision tree classification, we could refactor this two into an API more 
> like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15779) SQL context fails when Hive uses Tez as its default execution engine

2016-07-08 Thread Jonathan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367977#comment-15367977
 ] 

Jonathan Kelly commented on SPARK-15779:


Is there a well-defined list of properties we should include in Spark's copy of 
hive-site.xml?

> SQL context fails when Hive uses Tez as its default execution engine
> 
>
> Key: SPARK-15779
> URL: https://issues.apache.org/jira/browse/SPARK-15779
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, SQL
>Affects Versions: 1.6.1
> Environment: Hadoop 2.7.2, Spark 1.6.1, Hive 2.0.1, Tez 0.8.3
>Reporter: Alexandre Linte
>
> By default, Hive uses MapReduce as its default execution engine. Since Hive 
> 2.0.0, MapReduce is deprecated.
> To avoid this deprecation, I decided to use Tez instead of MapReduce as the 
> default execution engine. Unfortunately, this choice had an impact on Spark.
> Now when I start Spark the SQL context fails with the following error:
> {noformat}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_85)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:440)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:15)
> at $iwC.(:24)
> at (:26)
> at .(:30)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:124)
> at 
> 

[jira] [Commented] (SPARK-15804) Manually added metadata not saving with parquet

2016-07-08 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367974#comment-15367974
 ] 

Wenchen Fan commented on SPARK-15804:
-

this will be fixed by https://github.com/apache/spark/pull/14106

> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>Assignee: kevin yu
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe. Was working fine with 1.6.1.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16448) RemoveAliasOnlyProject should not remove alias with metadata

2016-07-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367972#comment-15367972
 ] 

Apache Spark commented on SPARK-16448:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/14106

> RemoveAliasOnlyProject should not remove alias with metadata
> 
>
> Key: SPARK-16448
> URL: https://issues.apache.org/jira/browse/SPARK-16448
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16448) RemoveAliasOnlyProject should not remove alias with metadata

2016-07-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16448:


Assignee: Apache Spark  (was: Wenchen Fan)

> RemoveAliasOnlyProject should not remove alias with metadata
> 
>
> Key: SPARK-16448
> URL: https://issues.apache.org/jira/browse/SPARK-16448
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16448) RemoveAliasOnlyProject should not remove alias with metadata

2016-07-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16448:


Assignee: Wenchen Fan  (was: Apache Spark)

> RemoveAliasOnlyProject should not remove alias with metadata
> 
>
> Key: SPARK-16448
> URL: https://issues.apache.org/jira/browse/SPARK-16448
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16448) RemoveAliasOnlyProject should not remove alias with metadata

2016-07-08 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-16448:
---

 Summary: RemoveAliasOnlyProject should not remove alias with 
metadata
 Key: SPARK-16448
 URL: https://issues.apache.org/jira/browse/SPARK-16448
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR

2016-07-08 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15367937#comment-15367937
 ] 

Xiangrui Meng commented on SPARK-16446:
---

[~yanboliang] Do you have time to work on this?

> Gaussian Mixture Model wrapper in SparkR
> 
>
> Key: SPARK-16446
> URL: https://issues.apache.org/jira/browse/SPARK-16446
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>
> Follow instructions in SPARK-16442 and implement Gaussian Mixture Model 
> wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR

2016-07-08 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16446:
-

 Summary: Gaussian Mixture Model wrapper in SparkR
 Key: SPARK-16446
 URL: https://issues.apache.org/jira/browse/SPARK-16446
 Project: Spark
  Issue Type: Sub-task
Reporter: Xiangrui Meng


Follow instructions in SPARK-16442 and implement Gaussian Mixture Model wrapper 
in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16447) LDA wrapper in SparkR

2016-07-08 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16447:
-

 Summary: LDA wrapper in SparkR
 Key: SPARK-16447
 URL: https://issues.apache.org/jira/browse/SPARK-16447
 Project: Spark
  Issue Type: Sub-task
Reporter: Xiangrui Meng


Follow instructions in SPARK-16442 and implement LDA wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >