[jira] [Created] (SPARK-19333) Files out of compliance with ASF policy

2017-01-22 Thread John D. Ament (JIRA)
John D. Ament created SPARK-19333:
-

 Summary: Files out of compliance with ASF policy
 Key: SPARK-19333
 URL: https://issues.apache.org/jira/browse/SPARK-19333
 Project: Spark
  Issue Type: Bug
Reporter: John D. Ament


ASF policy is that source files include our headers

http://www.apache.org/legal/release-policy.html#license-headers

However, there are a few files in spark's release that are missing headers.  
this is not exhaustive

https://github.com/apache/spark/blob/master/R/pkg/DESCRIPTION
https://github.com/apache/spark/blob/master/R/pkg/NAMESPACE




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19289) UnCache Dataset using Name

2017-01-22 Thread Kaushal Prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833983#comment-15833983
 ] 

Kaushal Prajapati commented on SPARK-19289:
---

[~srowen], yes you are right, names are not necessarily unique, then if we are 
able to list all the datasets like the same that we are doing  for RDDs using 
*sc.getPersistentRDDs* to list all the RDDs, it would be useful because if i am 
maintaining the unique names for Datasets in a application, i can easily 
uncache it or if i have a group of datasets with the same name, i can also 
uncache all the datasets using this feature.

> UnCache Dataset using Name
> --
>
> Key: SPARK-19289
> URL: https://issues.apache.org/jira/browse/SPARK-19289
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Kaushal Prajapati
>Priority: Minor
>  Labels: features
>
> We can Cache and Uncache any table using its name in Spark Sql.
> {code}
> df.createTempView("myTable")
> sqlContext.cacheTable("myTable")
> sqlContext.uncacheTable("myTable")
> {code}
> Likewise if it is possible to have some kind of uniqueness for names in 
> DataSets and an abstraction like the same that we have for tables. It would 
> be very useful
> {code}
> scala> val df = sc.range(1,1000).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> df.setName("MyDataset")
> res0: df.type = MyDataset
> scala> df.cache
> res1: df.type = MyDataset
> sqlContext.getDataSet("MyDataset")
> sqlContext.uncacheDataSet("MyDataset")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19282) RandomForestRegressionModel should expose getMaxDepth

2017-01-22 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833973#comment-15833973
 ] 

Xin Ren commented on SPARK-19282:
-

sorry Nick, now I cannot make it for this fix.

anyone else please take a look? thanks a lot

> RandomForestRegressionModel should expose getMaxDepth
> -
>
> Key: SPARK-19282
> URL: https://issues.apache.org/jira/browse/SPARK-19282
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2306) BoundedPriorityQueue is private and not registered with Kryo

2017-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833928#comment-15833928
 ] 

Apache Spark commented on SPARK-2306:
-

User 'bhardwajank' has created a pull request for this issue:
https://github.com/apache/spark/pull/1299

> BoundedPriorityQueue is private and not registered with Kryo
> 
>
> Key: SPARK-2306
> URL: https://issues.apache.org/jira/browse/SPARK-2306
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Daniel Darabos
>Assignee: ankit bhardwaj
> Fix For: 1.1.0
>
>
> Because BoundedPriorityQueue is private and not registered with Kryo, RDD.top 
> cannot be used when using Kryo (the recommended configuration).
> Curiously BoundedPriorityQueue is registered by GraphKryoRegistrator. But 
> that's the wrong registrator. (Is there one for Spark Core?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19309) disable common subexpression elimination for conditional expressions

2017-01-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19309.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16659
[https://github.com/apache/spark/pull/16659]

> disable common subexpression elimination for conditional expressions
> 
>
> Key: SPARK-19309
> URL: https://issues.apache.org/jira/browse/SPARK-19309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19328) using spark thrift server gets memory leak problem in `ExecutorsListener`

2017-01-22 Thread cen yuhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833925#comment-15833925
 ] 

cen yuhai edited comment on SPARK-19328 at 1/23/17 5:27 AM:


I think this is fixed by SPARK-17406, should we backport to spark2.0.x?


was (Author: cenyuhai):
I think this is fixed by SPARK-17406

> using spark thrift server gets memory leak problem in `ExecutorsListener`
> -
>
> Key: SPARK-19328
> URL: https://issues.apache.org/jira/browse/SPARK-19328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
>Reporter: roncenzhao
> Attachments: screenshot-1.png
>
>
> When I use spark-thrift-server, the memory usage will gradually increase. 
> I found that in `ExecutorsListener` all the data about executors is never 
> trimmed.
> So do we need some operations to trim the data ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19328) using spark thrift server gets memory leak problem in `ExecutorsListener`

2017-01-22 Thread cen yuhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833925#comment-15833925
 ] 

cen yuhai commented on SPARK-19328:
---

I think this is fixed by SPARK-17406

> using spark thrift server gets memory leak problem in `ExecutorsListener`
> -
>
> Key: SPARK-19328
> URL: https://issues.apache.org/jira/browse/SPARK-19328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
>Reporter: roncenzhao
> Attachments: screenshot-1.png
>
>
> When I use spark-thrift-server, the memory usage will gradually increase. 
> I found that in `ExecutorsListener` all the data about executors is never 
> trimmed.
> So do we need some operations to trim the data ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19229) Disallow Creating Hive Source Tables when Hive Support is Not Enabled

2017-01-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19229.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16587
[https://github.com/apache/spark/pull/16587]

> Disallow Creating Hive Source Tables when Hive Support is Not Enabled
> -
>
> Key: SPARK-19229
> URL: https://issues.apache.org/jira/browse/SPARK-19229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> It is weird to create Hive source tables when using InMemoryCatalog. We are 
> unable to operate it. We should block users to create Hive source tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19109) ORC metadata section can sometimes exceed protobuf message size limit

2017-01-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833870#comment-15833870
 ] 

Hyukjin Kwon commented on SPARK-19109:
--

It seems this JIRA describes upgrading the version of Hive dependency which is 
currently 1.2.1 up to my knowledge. 
The stacktrace seems related with {{OrcFileOperator.getFileReader}} in Spark 
side which uses {{OrcFile.createReader}} to infer the schema.
Let me leave a link related with this.

> ORC metadata section can sometimes exceed protobuf message size limit
> -
>
> Key: SPARK-19109
> URL: https://issues.apache.org/jira/browse/SPARK-19109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Nic Eggert
>
> Basically, Spark inherits HIVE-11592 from its Hive dependency. From that 
> issue:
> If there are too many small stripes and with many columns, the overhead for 
> storing metadata (column stats) can exceed the default protobuf message size 
> of 64MB. Reading such files will throw the following exception
> {code}
> Exception in thread "main" 
> com.google.protobuf.InvalidProtocolBufferException: Protocol message was too 
> large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase 
> the size limit.
> at 
> com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110)
> at 
> com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755)
> at 
> com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811)
> at 
> com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1331)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1281)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369)
> at 
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.(OrcProto.java:4887)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.(OrcProto.java:4803)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985)
> at 
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.(OrcProto.java:12925)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.(OrcProto.java:12872)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956)
> at 
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.(OrcProto.java:13599)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.(OrcProto.java:13546)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630)
> at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
> at 
> com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
> at 
> com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
> at 
> com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746)
> at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.(ReaderImpl.java:468)
> at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:314)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
> at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 

[jira] [Commented] (SPARK-19155) MLlib GeneralizedLinearRegression family and link should case insensitive

2017-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833850#comment-15833850
 ] 

Apache Spark commented on SPARK-19155:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16675

> MLlib GeneralizedLinearRegression family and link should case insensitive 
> --
>
> Key: SPARK-19155
> URL: https://issues.apache.org/jira/browse/SPARK-19155
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> ML {{GeneralizedLinearRegression}} only support lowercase input for 
> {{family}} and {{link}} currently. For example, the following code will throw 
> exception:
> {code}
> val trainer = new GeneralizedLinearRegression().setFamily("Gamma")
> {code}
> However, R glm only accepts families: {{gaussian, binomial, poisson, Gamma}}. 
> We should make {{family}} and {{link}} case insensitive. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19328) using spark thrift server gets memory leak problem in `ExecutorsListener`

2017-01-22 Thread roncenzhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833802#comment-15833802
 ] 

roncenzhao commented on SPARK-19328:


[~srowen] In the heap dump file, it shows that the `executorToLogUrls` and 
`executorIdToData` hold the majority of memory.
So I think we should add some strategies to trim the stale data.

> using spark thrift server gets memory leak problem in `ExecutorsListener`
> -
>
> Key: SPARK-19328
> URL: https://issues.apache.org/jira/browse/SPARK-19328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
>Reporter: roncenzhao
> Attachments: screenshot-1.png
>
>
> When I use spark-thrift-server, the memory usage will gradually increase. 
> I found that in `ExecutorsListener` all the data about executors is never 
> trimmed.
> So do we need some operations to trim the data ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19076) Upgrade Hive dependence to Hive 2.x

2017-01-22 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833801#comment-15833801
 ] 

Dongjoon Hyun commented on SPARK-19076:
---

Hi, [~dapengsun].
It seems to be a subset of SPARK-13446 because this is a way to achieve 
SPARK-13446.
Is it different?

> Upgrade Hive dependence to Hive 2.x
> ---
>
> Key: SPARK-19076
> URL: https://issues.apache.org/jira/browse/SPARK-19076
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dapeng Sun
>
> Currently the upstream Spark depends on Hive 1.2.1 to build package, and Hive 
> 2.0 has been released in February 2016, Hive 2.0.1 and 2.1.0  also released 
> for a long time, at Spark side, it is better to support Hive 2.0 and above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19328) using spark thrift server gets memory leak problem in `ExecutorsListener`

2017-01-22 Thread roncenzhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

roncenzhao updated SPARK-19328:
---
Attachment: screenshot-1.png

> using spark thrift server gets memory leak problem in `ExecutorsListener`
> -
>
> Key: SPARK-19328
> URL: https://issues.apache.org/jira/browse/SPARK-19328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
>Reporter: roncenzhao
> Attachments: screenshot-1.png
>
>
> When I use spark-thrift-server, the memory usage will gradually increase. 
> I found that in `ExecutorsListener` all the data about executors is never 
> trimmed.
> So do we need some operations to trim the data ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19322) Allow customizable column name in ScalaReflection.schemaFor

2017-01-22 Thread Wei Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu updated SPARK-19322:

Remaining Estimate: 24h
 Original Estimate: 24h

> Allow customizable column name in ScalaReflection.schemaFor
> ---
>
> Key: SPARK-19322
> URL: https://issues.apache.org/jira/browse/SPARK-19322
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Wei Liu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When creating StructType using ScalaRelfection.schemaFor, the column names 
> are defined by scala case class parameter names. This causes minor 
> inconvenience when reading/writing data between Python and Scala which follow 
> different naming conventions. Ideally, we should allow customize column names 
> using Scala annotations in a similar fashion to Jackson serialization 
> libraries.
> For example, we could define schema as :
> case class MySchema(@fieldName("long_field") longField: Long, 
> @fieldName("int_field") intField: Int)
> The field/column names can now be customized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19322) Allow customizable column name in ScalaReflection.schemaFor

2017-01-22 Thread Wei Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Liu updated SPARK-19322:

Description: 
When creating StructType using ScalaRelfection.schemaFor, the column names are 
defined by scala case class parameter names. This causes minor inconvenience 
when reading/writing data between Python and Scala which follow different 
naming conventions. Ideally, we should allow customize column names using Scala 
annotations in a similar fashion to Jackson serialization libraries.

For example, we could define schema as :

case class MySchema(@fieldName("long_field") longField: Long, 
@fieldName("int_field") intField: Int)

The field/column names can now be customized.

  was:When creating StructType using Scalarelfection.schemaFor, the column 
names are defined by scala case class parameter names. This causes minor 
inconvenience when reading/writing data between Python and Scala. Ideally, we 
should allow customize column names using Scala annotations in a similar 
fashion to Jackson serialization libraries.


> Allow customizable column name in ScalaReflection.schemaFor
> ---
>
> Key: SPARK-19322
> URL: https://issues.apache.org/jira/browse/SPARK-19322
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Wei Liu
>Priority: Minor
>
> When creating StructType using ScalaRelfection.schemaFor, the column names 
> are defined by scala case class parameter names. This causes minor 
> inconvenience when reading/writing data between Python and Scala which follow 
> different naming conventions. Ideally, we should allow customize column names 
> using Scala annotations in a similar fashion to Jackson serialization 
> libraries.
> For example, we could define schema as :
> case class MySchema(@fieldName("long_field") longField: Long, 
> @fieldName("int_field") intField: Int)
> The field/column names can now be customized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19308) Unable to write to Hive table where column names contains period (.)

2017-01-22 Thread Maria Rebelka (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maria Rebelka closed SPARK-19308.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Unable to write to Hive table where column names contains period (.)
> 
>
> Key: SPARK-19308
> URL: https://issues.apache.org/jira/browse/SPARK-19308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Maria Rebelka
>  Labels: hive
> Fix For: 2.2.0
>
>
> When saving DataFrame which contains columns with dots to Hive in append 
> mode, it only succeeds when the table doesn't exists yet.
> {noformat}
> scala> spark.sql("drop table test")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val test = sc.parallelize(Array("{\"a\":1,\"b.b\":2}"))
> test: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at 
> parallelize at :24
> scala> val j = spark.read.json(test)
> j: org.apache.spark.sql.DataFrame = [a: bigint, b.b: bigint]
> scala> j.write.mode("append").saveAsTable("test")
> // succeeds
> scala> j.write.mode("append").saveAsTable("test")
> org.apache.spark.sql.AnalysisException: cannot resolve '`b.b`' given input 
> columns: [a, b.b]; line 1 pos 0;
> 'Project [a#6L, 'b.b]
> +- LogicalRDD [a#6L, b.b#7L]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:269)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:279)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:283)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:283)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$8.apply(QueryPlan.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:288)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
>   at org.apache.spark.sql.Dataset.selectExpr(Dataset.scala:1004)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:236)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> 

[jira] [Commented] (SPARK-19308) Unable to write to Hive table where column names contains period (.)

2017-01-22 Thread Maria Rebelka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833657#comment-15833657
 ] 

Maria Rebelka commented on SPARK-19308:
---

Hi [~dongjoon],

Indeed, fixed in 2.2.0-SNAPSHOT.

Many thanks!

> Unable to write to Hive table where column names contains period (.)
> 
>
> Key: SPARK-19308
> URL: https://issues.apache.org/jira/browse/SPARK-19308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Maria Rebelka
>  Labels: hive
>
> When saving DataFrame which contains columns with dots to Hive in append 
> mode, it only succeeds when the table doesn't exists yet.
> {noformat}
> scala> spark.sql("drop table test")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val test = sc.parallelize(Array("{\"a\":1,\"b.b\":2}"))
> test: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at 
> parallelize at :24
> scala> val j = spark.read.json(test)
> j: org.apache.spark.sql.DataFrame = [a: bigint, b.b: bigint]
> scala> j.write.mode("append").saveAsTable("test")
> // succeeds
> scala> j.write.mode("append").saveAsTable("test")
> org.apache.spark.sql.AnalysisException: cannot resolve '`b.b`' given input 
> columns: [a, b.b]; line 1 pos 0;
> 'Project [a#6L, 'b.b]
> +- LogicalRDD [a#6L, b.b#7L]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:269)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:279)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:283)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:283)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$8.apply(QueryPlan.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:288)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2603)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:969)
>   at org.apache.spark.sql.Dataset.selectExpr(Dataset.scala:1004)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:236)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> 

[jira] [Updated] (SPARK-19218) Fix SET command to show a result correctly and in a sorted order

2017-01-22 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19218:
--
Description: 
This issue aims to fix the following two things.

1. `sql("SET -v").collect()` or `sql("SET -v").show()` raises the following 
exceptions for String configuration with default value, `null`. For the test, 
please see [Jenkins 
result](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71539/testReport/)
 and 
https://github.com/apache/spark/commit/60953bf1f1ba144e709fdae3903a390ff9479fd0 
in #16624 .

{code}
sbt.ForkMain$ForkError: java.lang.RuntimeException: Error while decoding: 
java.lang.NullPointerException
createexternalrow(input[0, string, false].toString, input[1, string, 
false].toString, input[2, string, false].toString, 
StructField(key,StringType,false), StructField(value,StringType,false), 
StructField(meaning,StringType,false))
:- input[0, string, false].toString
:  +- input[0, string, false]
:- input[1, string, false].toString
:  +- input[1, string, false]
+- input[2, string, false].toString
   +- input[2, string, false]
{code}

2. Currently, `SET` and `SET -v` commands show unsorted result.
We had better show a sorted result for UX. Also, this is compatible with 
Hive.

*BEFORE*
{code}
scala> sql("set").show(false)
...
|spark.driver.host  |10.22.16.140   

  |
|spark.driver.port  |63893  

  |
|spark.repl.class.uri   |spark://10.22.16.140:63893/classes 

  |
...
|spark.app.name |Spark shell

  |
|spark.driver.memory|4G 

  |
|spark.executor.id  |driver 

  |
|spark.submit.deployMode|client 

  |
|spark.master   |local[*]   

  |
|spark.home |/Users/dhyun/spark 

  |
|spark.sql.catalogImplementation|hive   

  |
|spark.app.id   |local-1484333618945

  |
{code}

*AFTER*

{code}
scala> sql("set").show(false)
...
|spark.app.id   |local-1484333925649

  |
|spark.app.name |Spark shell

  |
|spark.driver.host  |10.22.16.140   

  |
|spark.driver.memory|4G 

  |
|spark.driver.port  |64994  

  |
|spark.executor.id  |driver 

  |
|spark.jars |   

  |
|spark.master   |local[*]   

  |
|spark.repl.class.uri   |spark://10.22.16.140:64994/classes 

   

[jira] [Updated] (SPARK-19218) Fix SET command to show a result correctly and in a sorted order

2017-01-22 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19218:
--
Summary: Fix SET command to show a result correctly and in a sorted order  
(was: SET command should show a sorted result)

> Fix SET command to show a result correctly and in a sorted order
> 
>
> Key: SPARK-19218
> URL: https://issues.apache.org/jira/browse/SPARK-19218
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Currently, `SET` command shows unsorted result. We had better show a sorted 
> result for UX. Also, this is compatible with Hive.
> *BEFORE*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |spark.driver.host  |10.22.16.140 
>   
>   |
> |spark.driver.port  |63893
>   
>   |
> |hive.metastore.warehouse.dir   |file:/Users/dhyun/spark/spark-warehouse  
>   
>   |
> |spark.repl.class.uri   |spark://10.22.16.140:63893/classes   
>   
>   |
> |spark.jars | 
>   
>   |
> |spark.repl.class.outputDir 
> |/private/var/folders/bl/67vhzgqs1ks88l92h8dy8_1rgp/T/spark-43da424e-7530-4053-b30e-4068e8424dc9/repl-f1c957c7-2e4a-4f14-b234-f7b9f2447971|
> |spark.app.name |Spark shell  
>   
>   |
> |spark.driver.memory|4G   
>   
>   |
> |spark.executor.id  |driver   
>   
>   |
> |spark.submit.deployMode|client   
>   
>   |
> |spark.master   |local[*] 
>   
>   |
> |spark.home |/Users/dhyun/spark   
>   
>   |
> |spark.sql.catalogImplementation|hive 
>   
>   |
> |spark.app.id   |local-1484333618945  
>   
>   |
> +---+-+
> {code}
> *AFTER*
> {code}
> scala> sql("set").show(false)
> +---+-+
> |key|value
>   
>   |
> +---+-+
> |hive.metastore.warehouse.dir   
> |file:/Users/dhyun/SPARK-SORTED-SET/spark-warehouse   
>  

[jira] [Commented] (SPARK-19288) Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in R/run-tests.sh

2017-01-22 Thread Nirman Narang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833608#comment-15833608
 ] 

Nirman Narang commented on SPARK-19288:
---

[~felixcheung] I am using R version 3.2.3.
Tested on X86_64 and ppc64le with Ubuntu 16.04 and it fails on both.

> Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in 
> R/run-tests.sh
> --
>
> Key: SPARK-19288
> URL: https://issues.apache.org/jira/browse/SPARK-19288
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL, Tests
>Affects Versions: 2.0.1
> Environment: Ubuntu 16.04, X86_64, ppc64le
>Reporter: Nirman Narang
>
> Full log here.
> {code:title=R/run-tests.sh|borderStyle=solid}
> Loading required package: methods
> Attaching package: 'SparkR'
> The following object is masked from 'package:testthat':
> describe
> The following objects are masked from 'package:stats':
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from 'package:base':
> as.data.frame, colnames, colnames<-, drop, intersect, rank, rbind,
> sample, subset, summary, transform, union
> functions on binary files : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> 
> binary functions : ...
> broadcast variables : ..
> functions in client.R : .
> test functions in sparkR.R : .Re-using existing Spark Context. Call 
> sparkR.session.stop() or restart R to create a new Spark Context
> ...
> include R packages : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> JVM API : ..
> MLlib functions : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> .SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> .Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 65,622
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [label] 
> BINARY: 1 values, 21B raw, 23B comp, 1 pages, encodings: [PLAIN, BIT_PACKED, 
> RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 87B for [terms, 
> list, element, list, element] BINARY: 2 values, 42B raw, 43B comp, 1 pages, 
> encodings: [PLAIN, RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for 
> [hasIntercept] BOOLEAN: 1 values, 1B raw, 3B comp, 1 pages, encodings: 
> [PLAIN, BIT_PACKED]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 49
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 90B for [labels, 
> list, element] BINARY: 3 values, 50B raw, 50B comp, 1 pages, encodings: 
> [PLAIN, RLE]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:55 PM INFO: 

[jira] [Updated] (SPARK-19313) GaussianMixture throws cryptic error when number of features is too high

2017-01-22 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-19313:

Shepherd: Yanbo Liang

> GaussianMixture throws cryptic error when number of features is too high
> 
>
> Key: SPARK-19313
> URL: https://issues.apache.org/jira/browse/SPARK-19313
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> The following fails
> {code}
> val df = Seq(
>   Vectors.sparse(46400, Array(0, 4), Array(3.0, 8.0)),
>   Vectors.sparse(46400, Array(1, 5), Array(4.0, 9.0)))
>   .map(Tuple1.apply).toDF("features")
> val gm = new GaussianMixture()
> gm.fit(df)
> {code}
> It fails because GMMs allocate an array of size {{numFeatures * numFeatures}} 
> and in this case we'll get integer overflow. We should limit the number of 
> features appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19332) table's location should check if a URI is legal

2017-01-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833532#comment-15833532
 ] 

Sean Owen commented on SPARK-19332:
---

I don't see how this is separate from SPARK-19257. These should be comments on 
that JIRA and implemented there.

> table's location should check if a URI is legal
> ---
>
> Key: SPARK-19332
> URL: https://issues.apache.org/jira/browse/SPARK-19332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
> locationUri to `URI`, while it has some problem:
> 1.`CatalogTable` and `CatalogTablePartition` use the same class 
> `CatalogStorageFormat`
> 2. the type URI is ok for `CatalogTable`, but it is not proper for 
> `CatalogTablePartition`
> 3. the location of a table partition can contains a not encode whitespace, so 
>   if a partition location contains this not encode whitespace, and it will 
> throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
> partition location which has whitespace
> so if we change the type to URI, it is bad for `CatalogTablePartition`
> and I found Hive has the same issue HIVE-6185
> before hive 0.13 the location is URI, while after above PR, it change it to 
> Path, and do some check when DDL.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732
> so I think ,we can do the URI check for the table's location , and it is not 
> proper to change the type to URI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19331) Improve the test coverage of SQLViewSuite

2017-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19331:


Assignee: (was: Apache Spark)

> Improve the test coverage of SQLViewSuite
> -
>
> Key: SPARK-19331
> URL: https://issues.apache.org/jira/browse/SPARK-19331
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> Improve the test coverage of SQLViewSuite, cover the following cases:
> 1. view resolution;
> 2. nested view;
> 3. view with user specified column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19331) Improve the test coverage of SQLViewSuite

2017-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833481#comment-15833481
 ] 

Apache Spark commented on SPARK-19331:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/16674

> Improve the test coverage of SQLViewSuite
> -
>
> Key: SPARK-19331
> URL: https://issues.apache.org/jira/browse/SPARK-19331
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> Improve the test coverage of SQLViewSuite, cover the following cases:
> 1. view resolution;
> 2. nested view;
> 3. view with user specified column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19331) Improve the test coverage of SQLViewSuite

2017-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19331:


Assignee: Apache Spark

> Improve the test coverage of SQLViewSuite
> -
>
> Key: SPARK-19331
> URL: https://issues.apache.org/jira/browse/SPARK-19331
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Minor
>
> Improve the test coverage of SQLViewSuite, cover the following cases:
> 1. view resolution;
> 2. nested view;
> 3. view with user specified column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19332) table's location should check if a URI is legal

2017-01-22 Thread Song Jun (JIRA)
Song Jun created SPARK-19332:


 Summary: table's location should check if a URI is legal
 Key: SPARK-19332
 URL: https://issues.apache.org/jira/browse/SPARK-19332
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun


~SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
locationUri to `URI`, while it has some problem:

1.`CatalogTable` and `CatalogTablePartition` use the same class 
`CatalogStorageFormat`
2. the type URI is ok for `CatalogTable`, but it is not proper for 
`CatalogTablePartition`
3. the location of a table partition can contains a not encode whitespace, so 
  if a partition location contains this not encode whitespace, and it will 
throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
partition location which has whitespace

so if we change the type to URI, it is bad for `CatalogTablePartition`

and I found Hive has the same issue ~HIVE-6185
before hive 0.13 the location is URI, while after above PR, it change it to 
Path, and do some check when DDL.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

so I think ,we can do the URI check for the table's location , and it is not 
proper to change the type to URI.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19332) table's location should check if a URI is legal

2017-01-22 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19332:
-
Description: 
SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
locationUri to `URI`, while it has some problem:

1.`CatalogTable` and `CatalogTablePartition` use the same class 
`CatalogStorageFormat`
2. the type URI is ok for `CatalogTable`, but it is not proper for 
`CatalogTablePartition`
3. the location of a table partition can contains a not encode whitespace, so 
  if a partition location contains this not encode whitespace, and it will 
throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
partition location which has whitespace

so if we change the type to URI, it is bad for `CatalogTablePartition`

and I found Hive has the same issue HIVE-6185
before hive 0.13 the location is URI, while after above PR, it change it to 
Path, and do some check when DDL.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

so I think ,we can do the URI check for the table's location , and it is not 
proper to change the type to URI.


  was:
~SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
locationUri to `URI`, while it has some problem:

1.`CatalogTable` and `CatalogTablePartition` use the same class 
`CatalogStorageFormat`
2. the type URI is ok for `CatalogTable`, but it is not proper for 
`CatalogTablePartition`
3. the location of a table partition can contains a not encode whitespace, so 
  if a partition location contains this not encode whitespace, and it will 
throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
partition location which has whitespace

so if we change the type to URI, it is bad for `CatalogTablePartition`

and I found Hive has the same issue ~HIVE-6185
before hive 0.13 the location is URI, while after above PR, it change it to 
Path, and do some check when DDL.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

so I think ,we can do the URI check for the table's location , and it is not 
proper to change the type to URI.



> table's location should check if a URI is legal
> ---
>
> Key: SPARK-19332
> URL: https://issues.apache.org/jira/browse/SPARK-19332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
> locationUri to `URI`, while it has some problem:
> 1.`CatalogTable` and `CatalogTablePartition` use the same class 
> `CatalogStorageFormat`
> 2. the type URI is ok for `CatalogTable`, but it is not proper for 
> `CatalogTablePartition`
> 3. the location of a table partition can contains a not encode whitespace, so 
>   if a partition location contains this not encode whitespace, and it will 
> throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
> partition location which has whitespace
> so if we change the type to URI, it is bad for `CatalogTablePartition`
> and I found Hive has the same issue HIVE-6185
> before hive 0.13 the location is URI, while after above PR, it change it to 
> Path, and do some check when DDL.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732
> so I think ,we can do the URI check for the table's location , and it is not 
> proper to change the type to URI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19331) Improve the test coverage of SQLViewSuite

2017-01-22 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-19331:


 Summary: Improve the test coverage of SQLViewSuite
 Key: SPARK-19331
 URL: https://issues.apache.org/jira/browse/SPARK-19331
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Jiang Xingbo
Priority: Minor


Improve the test coverage of SQLViewSuite, cover the following cases:
1. view resolution;
2. nested view;
3. view with user specified column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-22 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun edited comment on SPARK-19257 at 1/22/17 11:22 AM:


I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition 
and CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

HIVE-0.12 the table's location method's paramenter Type is URI, after that 
version change to Path
https://github.com/apache/hive/blob/branch-0.12/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L501

this is the same issue in hive: 
https://issues.apache.org/jira/browse/HIVE-6185

[~cloud_fan] [~smilegator]


was (Author: windpiger):
I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition 
and CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

[~cloud_fan] [~smilegator]

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19323) Upgrade breeze to 0.13

2017-01-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833455#comment-15833455
 ] 

Sean Owen commented on SPARK-19323:
---

OK, while waiting, you'd need to investigate if there are breaking changes. Is 
there anything about 0.13 that would cause apps to be incompatible?

> Upgrade breeze to 0.13
> --
>
> Key: SPARK-19323
> URL: https://issues.apache.org/jira/browse/SPARK-19323
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: koert kuipers
>Priority: Minor
>
> SPARK-16494 upgraded breeze to 0.12. this unfortunately brings in a new 
> dependency on an old versions of shapelesss (v2.0.0). breeze 0.13 depends on 
> newer shapeless. breeze 0.13 is currently rc1 so will have to wait a bit.
> see discussion here:
> http://apache-spark-developers-list.1001551.n3.nabble.com/shapeless-in-spark-2-1-0-td20392.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19264) Work should start driver, the same to AM of yarn

2017-01-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833451#comment-15833451
 ] 

Sean Owen commented on SPARK-19264:
---

I don't think the driver should terminate immediately if the main thread quits, 
because that is not how any JVM app works. The main reason is it abruptly 
cancels any cleanup. It's fair to assume the app is terminating, but not 
terminated. The AM behavior makes sense. But terminating does not mean done.

> Work should start driver, the same to  AM  of yarn 
> ---
>
> Key: SPARK-19264
> URL: https://issues.apache.org/jira/browse/SPARK-19264
> Project: Spark
>  Issue Type: Improvement
>Reporter: hustfxj
>
>   I think work can't start driver by "ProcessBuilderLike",  thus we can't 
> know the application's main thread is finished or not if the application's 
> main thread contains some non-daemon threads. Because the program terminates 
> when there no longer is any non-daemon thread running (or someone called 
> System.exit). The main thread can have finished long ago. 
> worker should  start driver like AM of YARN . As followed:
> {code:title=ApplicationMaster.scala|borderStyle=solid}
>  mainMethod.invoke(null, userArgs.toArray)
>  finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
>  logDebug("Done running users class")
> {code}
> Then the work can monitor the driver's main thread, and know the 
> application's state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19214) Inconsistencies between DataFrame and Dataset APIs

2017-01-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833447#comment-15833447
 ] 

Sean Owen commented on SPARK-19214:
---

I'm pretty sure we would not change the naming here, even if it were a bit 
inconsistent, because it's a breaking change. I am not yet clear it's not on 
purpose though.

> Inconsistencies between DataFrame and Dataset APIs
> --
>
> Key: SPARK-19214
> URL: https://issues.apache.org/jira/browse/SPARK-19214
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Alexander Alexandrov
>Priority: Trivial
>
> I am not sure whether this has been reported already, but there are some 
> confusing & annoying inconsistencies when programming the same expression in 
> the Dataset and the DataFrame APIs.
> Consider the following minimal example executed in a Spark Shell:
> {code}
> case class Point(x: Int, y: Int, z: Int)
> val ps = spark.createDataset(for {
>   x <- 1 to 10; 
>   y <- 1 to 10; 
>   z <- 1 to 10
> } yield Point(x, y, z))
> // Problem 1:
> // count produces different fields in the Dataset / DataFrame variants
> // count() on grouped DataFrame: field name is `count`
> ps.groupBy($"x").count().printSchema
> // root
> //  |-- x: integer (nullable = false)
> //  |-- count: long (nullable = false)
> // count() on grouped Dataset: field name is `count(1)`
> ps.groupByKey(_.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // Problem 2:
> // groupByKey produces different `key` field name depending
> // on the result type
> // this is especially confusing in the first case below (simple key types)
> // where the key field is actually named `value`
> // simple key types
> ps.groupByKey(p => p.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // complex key types
> ps.groupByKey(p => (p.x, p.y)).count().printSchema
> // root
> //  |-- key: struct (nullable = false)
> //  ||-- _1: integer (nullable = true)
> //  ||-- _2: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19015) SQL request with transformation cannot be eecuted if not run first a scan table

2017-01-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19015.
---
Resolution: Invalid

> SQL request with transformation cannot be eecuted if not run first a scan 
> table
> ---
>
> Key: SPARK-19015
> URL: https://issues.apache.org/jira/browse/SPARK-19015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: lakhdar adil
>
> Hello,
> I have a spark streaming wich turn on kafka and send results to ElasticSearch.
> I have an union request between two tables: "statswithrowid" table and  
> "queryes" table
> sqlContext.sql(s"select id, rowid,agentId,datecalcul,'KAFKA' as source from 
> statswithrowid  where id IN ($ids) and agentId = '$agent' UNION select id, 
> rowid,agentId,datecalcul, 'ES' as source from queryes where agentId = 
> '$agent'")
> This request cannot be executed lonely. Today i need to execute first those 
> two requests so that my union request can be working fine. Please find below 
> my two requests which must be launched first before union request :
> request on "statswithrowid " table :
>   sqlContext.sql(s"select id, rowid,agentId,datecalcul,'KAFKA' as 
> source from statswithrowid  where id IN ($ids) and agentId = '$agent'").show()
> request on "queryes" table : 
>   sqlContext.sql(s"select id, rowid,agentId,datecalcul, 'ES' as 
> source from queryes where agentId = '$agent'").show()
> For information : if i don't mention .show() on two requests to launch before 
> union, nothing can work.
> Why i need to launch that first before making union request ? what is the 
> best way to work with union request ? i try union with dataframe, and i have 
> the same probleme.
> I look forward your reply. Thank you in advance
> Best regards,
> Adil LAKHDAR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19328) using spark thrift server gets memory leak problem in `ExecutorsListener`

2017-01-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833440#comment-15833440
 ] 

Sean Owen commented on SPARK-19328:
---

Questions should probably start on the mailing list.
What is taking up memory, do you think? 

> using spark thrift server gets memory leak problem in `ExecutorsListener`
> -
>
> Key: SPARK-19328
> URL: https://issues.apache.org/jira/browse/SPARK-19328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
>Reporter: roncenzhao
>
> When I use spark-thrift-server, the memory usage will gradually increase. 
> I found that in `ExecutorsListener` all the data about executors is never 
> trimmed.
> So do we need some operations to trim the data ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19330) Also show tooltip for successful batches

2017-01-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19330:
--
Priority: Trivial  (was: Major)

> Also show tooltip for successful batches
> 
>
> Key: SPARK-19330
> URL: https://issues.apache.org/jira/browse/SPARK-19330
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.1.0, 2.1.1
>Reporter: Liwei Lin
>Priority: Trivial
>
> [Before]
> !https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png!
> [After]
> !https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18563) mapWithState: initialState should have a timeout setting per record

2017-01-22 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833416#comment-15833416
 ] 

Genmao Yu edited comment on SPARK-18563 at 1/22/17 9:35 AM:


I do not know is there any plan to add new feature to DStreams? Maybe, we 
should focus on Structured Streaming?

[~zsxwing]


was (Author: unclegen):
I do not know is there any plan to add new feature to DStreams? Maybe, we 
should focus on Structured Streaming?

> mapWithState: initialState should have a timeout setting per record
> ---
>
> Key: SPARK-18563
> URL: https://issues.apache.org/jira/browse/SPARK-18563
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Reporter: Daniel Haviv
>
> when passing an initialState for mapWithState there should a possibility to 
> set a timeout at the record level.
> If for example mapWithState is configured with a 48H timeout, loading an 
> initialState will cause the state to bloat and hold 96H of data and then 
> release 48H of data at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18563) mapWithState: initialState should have a timeout setting per record

2017-01-22 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833416#comment-15833416
 ] 

Genmao Yu commented on SPARK-18563:
---

I do not know is there any plan to add new feature to DStreams? Maybe, we 
should focus on Structured Streaming?

> mapWithState: initialState should have a timeout setting per record
> ---
>
> Key: SPARK-18563
> URL: https://issues.apache.org/jira/browse/SPARK-18563
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Reporter: Daniel Haviv
>
> when passing an initialState for mapWithState there should a possibility to 
> set a timeout at the record level.
> If for example mapWithState is configured with a 48H timeout, loading an 
> initialState will cause the state to bloat and hold 96H of data and then 
> release 48H of data at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19330) Also show tooltip for successful batches

2017-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833415#comment-15833415
 ] 

Apache Spark commented on SPARK-19330:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16673

> Also show tooltip for successful batches
> 
>
> Key: SPARK-19330
> URL: https://issues.apache.org/jira/browse/SPARK-19330
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.1.0, 2.1.1
>Reporter: Liwei Lin
>
> [Before]
> !https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png!
> [After]
> !https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19330) Also show tooltip for successful batches

2017-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19330:


Assignee: Apache Spark

> Also show tooltip for successful batches
> 
>
> Key: SPARK-19330
> URL: https://issues.apache.org/jira/browse/SPARK-19330
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.1.0, 2.1.1
>Reporter: Liwei Lin
>Assignee: Apache Spark
>
> [Before]
> !https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png!
> [After]
> !https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19330) Also show tooltip for successful batches

2017-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19330:


Assignee: (was: Apache Spark)

> Also show tooltip for successful batches
> 
>
> Key: SPARK-19330
> URL: https://issues.apache.org/jira/browse/SPARK-19330
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.1.0, 2.1.1
>Reporter: Liwei Lin
>
> [Before]
> !https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png!
> [After]
> !https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19330) Also show tooltip for successful batches

2017-01-22 Thread Liwei Lin (JIRA)
Liwei Lin created SPARK-19330:
-

 Summary: Also show tooltip for successful batches
 Key: SPARK-19330
 URL: https://issues.apache.org/jira/browse/SPARK-19330
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0, 2.0.3, 2.1.1
Reporter: Liwei Lin


[Before]
!https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png!

[After]
!https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception

2017-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19329:


Assignee: (was: Apache Spark)

> after alter a datasource table's location to a not exist location and then 
> insert data throw Exception
> --
>
> Key: SPARK-19329
> URL: https://issues.apache.org/jira/browse/SPARK-19329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> spark.sql("create table t(a string, b int) using parquet")
> spark.sql(s"alter table t set location '$notexistedlocation'")
> spark.sql("insert into table t select 'c', 1")
> this will throw an exception:
> com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> $notexistedlocation;
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception

2017-01-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833413#comment-15833413
 ] 

Apache Spark commented on SPARK-19329:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/16672

> after alter a datasource table's location to a not exist location and then 
> insert data throw Exception
> --
>
> Key: SPARK-19329
> URL: https://issues.apache.org/jira/browse/SPARK-19329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> spark.sql("create table t(a string, b int) using parquet")
> spark.sql(s"alter table t set location '$notexistedlocation'")
> spark.sql("insert into table t select 'c', 1")
> this will throw an exception:
> com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> $notexistedlocation;
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception

2017-01-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19329:


Assignee: Apache Spark

> after alter a datasource table's location to a not exist location and then 
> insert data throw Exception
> --
>
> Key: SPARK-19329
> URL: https://issues.apache.org/jira/browse/SPARK-19329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>Assignee: Apache Spark
>
> spark.sql("create table t(a string, b int) using parquet")
> spark.sql(s"alter table t set location '$notexistedlocation'")
> spark.sql("insert into table t select 'c', 1")
> this will throw an exception:
> com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> $notexistedlocation;
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18839) Executor is active on web, but actually is dead

2017-01-22 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833409#comment-15833409
 ] 

Genmao Yu commented on SPARK-18839:
---

Sorry, I do not think this is a bug.

> Executor is active on web, but actually is dead
> ---
>
> Key: SPARK-18839
> URL: https://issues.apache.org/jira/browse/SPARK-18839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: meiyoula
>Priority: Minor
>
> When a container is preempted, AM find it is completed, driver removes the 
> blockmanager. But executor actually dead after a few seconds, during this 
> period, it updates blocks, and re-register the blockmanager. so the exeutors 
> page show the executor is active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception

2017-01-22 Thread Song Jun (JIRA)
Song Jun created SPARK-19329:


 Summary: after alter a datasource table's location to a not exist 
location and then insert data throw Exception
 Key: SPARK-19329
 URL: https://issues.apache.org/jira/browse/SPARK-19329
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun


spark.sql("create table t(a string, b int) using parquet")
spark.sql(s"alter table t set location '$notexistedlocation'")
spark.sql("insert into table t select 'c', 1")

this will throw an exception:

com.google.common.util.concurrent.UncheckedExecutionException: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
$notexistedlocation;
at 
com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
at 
org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow

2017-01-22 Thread Jelmer Kuperus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833386#comment-15833386
 ] 

Jelmer Kuperus commented on SPARK-18750:


I am seeing the exact same issue when using dynamic allocation and doing just a 
basic spark sql query over a large data set

> spark should be able to control the number of executor and should not throw 
> stack overslow
> --
>
> Key: SPARK-18750
> URL: https://issues.apache.org/jira/browse/SPARK-18750
> Project: Spark
>  Issue Type: Bug
>Reporter: Neerja Khattar
>
> When running Sql queries on large datasets. Job fails with stack overflow 
> warning and it shows it is requesting lots of executors.
> Looks like there is no limit to number of executors or not even having an 
> upperbound based on yarn available resources.
> {noformat}
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s). 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead 
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s). 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s).
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s).
> 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 
> time(s) in a row.
> java.lang.StackOverflowError
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156)
>   at 
> scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
>   at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> 

[jira] [Commented] (SPARK-18805) InternalMapWithStateDStream make java.lang.StackOverflowError

2017-01-22 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833377#comment-15833377
 ] 

Genmao Yu commented on SPARK-18805:
---

+ 1 to {{That should be not an infinite loop. The time is different on each 
call.}}

And, there is really a potential {{StackOverflowError}} issue, but it is hard 
to arise according to my understanding. Could you please provide your setting 
of checkpointDuration and batchDuration?

> InternalMapWithStateDStream make java.lang.StackOverflowError 
> --
>
> Key: SPARK-18805
> URL: https://issues.apache.org/jira/browse/SPARK-18805
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.0.2
> Environment: mesos
>Reporter: etienne
>
> When load InternalMapWithStateDStream from a check point.
> If isValidTime is true and if there is no generatedRDD at the given time 
> there is an infinite loop.
> 1) compute is call on InternalMapWithStateDStream
> 2) InternalMapWithStateDStream try to generate the previousRDD
> 3) Stream look in generatedRDD if the RDD is already generated for the given 
> time 
> 4) It not fund the rdd so it check if the time is valid.
> 5) if the time is valid call compute on InternalMapWithStateDStream
> 6) restart from 1)
> Here the exception that illustrate this error
> {code}
> Exception in thread "streaming-start" java.lang.StackOverflowError
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19260) Spaces or "%20" in path parameter are not correctly handled with HistoryServer

2017-01-22 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-19260:

Summary:  Spaces or "%20" in path parameter are not correctly handled with 
HistoryServer  (was:  Spaces in path parameter are not correctly handled with 
HistoryServer)

>  Spaces or "%20" in path parameter are not correctly handled with 
> HistoryServer
> ---
>
> Key: SPARK-19260
> URL: https://issues.apache.org/jira/browse/SPARK-19260
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux
>Reporter: zuotingbing
>
> case1: .sbin/start-history-server.sh "file:/a b"
> And you get the error ===>
> Caused by: java.lang.IllegalArgumentException: Log directory specified does 
> not exist: file:/root/file:/a%20b.
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:201)
> case2: ./start-history-server.sh "/a%20c"
> And you get the error ===>
> Caused by: java.lang.IllegalArgumentException: Log directory specified does 
> not exist: file:/a%2520c/.
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:201)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19260) Spaces in path parameter are not correctly handled with HistoryServer

2017-01-22 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-19260:

Description: 
case1: .sbin/start-history-server.sh "file:/a b"
And you get the error ===>
Caused by: java.lang.IllegalArgumentException: Log directory specified does not 
exist: file:/root/file:/a%20b.
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:201)

case2: ./start-history-server.sh "/a%20c"
And you get the error ===>
Caused by: java.lang.IllegalArgumentException: Log directory specified does not 
exist: file:/a%2520c/.
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:201)

  was:
run .sbin/start-history-server.sh "file:/a b"

And you get the error:
Caused by: java.lang.IllegalArgumentException: Log directory specified does not 
exist: file:/root/file:/a%20b.
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:201)


>  Spaces in path parameter are not correctly handled with HistoryServer
> --
>
> Key: SPARK-19260
> URL: https://issues.apache.org/jira/browse/SPARK-19260
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux
>Reporter: zuotingbing
>
> case1: .sbin/start-history-server.sh "file:/a b"
> And you get the error ===>
> Caused by: java.lang.IllegalArgumentException: Log directory specified does 
> not exist: file:/root/file:/a%20b.
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:201)
> case2: ./start-history-server.sh "/a%20c"
> And you get the error ===>
> Caused by: java.lang.IllegalArgumentException: Log directory specified does 
> not exist: file:/a%2520c/.
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:201)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2017-01-22 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833367#comment-15833367
 ] 

Ran Haim commented on SPARK-17436:
--

Hi,
I did not actually use 2.1 yet - so I cannot be 100% sure.
I can see that on org.apache.spark.sql.execution.datasources.FileFormatWriter 
it is creating a sorter using the inner sorting used:
val sortingExpressions: Seq[Expression] =
description.partitionColumns ++ bucketIdExpression ++ sortColumns

The best way to test it is by using orcdump/parquet-tools by querying the 
individual files and making sure they are sorted.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.1 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer perfermence optimization

2017-01-22 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833345#comment-15833345
 ] 

zhengruifeng edited comment on SPARK-19208 at 1/22/17 8:25 AM:
---

After diving into sparksql's udaf, I design the new api like this:

new MultivariateOnlineSummarizer in org.apache.spark.ml.stat
{code}
class MultivariateOnlineSummarizer(private var metrics: Seq[String]) extends 
UserDefinedAggregateFunction {
def setMetrics(metrics: Seq[String]) = ...
def setMetrics(metric: String, others: String*) = ...
override def inputSchema: StructType = new StructType().add("weight", 
DoubleType).add("features", new VectorUDT)
override def bufferSchema: StructType = ...
override def dataType: DataType = DataTypes.createMapType(StringType, new 
VectorUDT)
override def deterministic: Boolean = true
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = 
...
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = 
...
override def evaluate(buffer: Row): Map[String,Vector] = ...
}
{code}

usage examples:
{code}
// in MinMaxScaler
val maxAbsAgg = new MultivariateOnlineSummarizer().setMetrics("min", "max")
val summary = dataset.groupBy().agg(maxAbsAgg(col("features"), col("weight")))
summary.show
+-+
|multivariateonlinesummarizer(features, weight)|
+-+
| Map(min -> [1.0,0...|
+-+
summary.first
res2: org.apache.spark.sql.Row = [Map(min -> [1.0,0.2,0.2], max -> 
[1.0,0.2,0.2])]
val result = summary.first.getAs[Map[String,Vector]](0)
val min: Vector = result("min")
val max: Vector = result("max")

// in LinearRegression
val featuresAgg = new MultivariateOnlineSummarizer().setMetrics("mean", 
"variance")
val labelAgg = new MultivariateOnlineSummarizer().setMetrics("mean", "variance")

val result = dataset.map{...}.groupBy().agg(featuresAgg(col("features"), 
col("weight")), labelAgg(col("labelVec"), col("weight"))).first
val featuresMetrics = result.getAs[Map[String,Vector]](0)
val labelMetrics = result.getAs[Map[String,Vector]](1)

// in NaiveBayes (if we add "weightedCount" and "sum" in metrics)
val sumAgg = new MultivariateOnlineSummarizer().setMetrics("weightedCount", 
"sum")
val aggregated = dataset.select(col($(labelCol)), w, col($(featuresCol)))
.map{...requireValues(features)...}
.groupBy($(labelCol)).agg(sumAgg(col("features"), col("weight")))
...
{code}

I have not found a way to output multi columns in udaf, so I use 
{{Map[String,Vector]}} as the output type temporarily. If there is some way, 
I'll be happy to modify this place.


was (Author: podongfeng):
After diving into sparksql's udaf, I design the new api like this:

new MultivariateOnlineSummarizer in org.apache.spark.ml.stat
{code}
class MultivariateOnlineSummarizer(private var metrics: Seq[String]) extends 
UserDefinedAggregateFunction {
def setMetrics(metrics: Seq[String]) = ...
def setMetrics(metric: String, others: String*) = ...
override def inputSchema: StructType = new StructType().add("weight", 
DoubleType).add("features", new VectorUDT)
override def bufferSchema: StructType = ...
override def dataType: DataType = DataTypes.createMapType(StringType, new 
VectorUDT)
override def deterministic: Boolean = true
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = 
...
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = 
...
override def evaluate(buffer: Row): Map[String,Vector] = ...
}
{code}

usage examples:
{code}
// in MinMaxScaler
val maxAbsAgg = new MultivariateOnlineSummarizer().setMetrics("min", "max")
val summary = dataset.groupBy().agg(maxAbsAgg(col("features"), col("weight")))
summary.show
+-+
|multivariateonlinesummarizer(features, weight)|
+-+
| Map(min -> [1.0,0...|
+-+
summary.first
res2: org.apache.spark.sql.Row = [Map(min -> [1.0,0.2,0.2], max -> 
[1.0,0.2,0.2])]
val result = summary.first.getAs[Map[String,Vector]](0)
val min: Vector = result("min")
val max: Vector = result("max")

// in LinearRegression
val featuresAgg = new MultivariateOnlineSummarizer().setMetrics("mean", 
"variance")
val labelAgg = new MultivariateOnlineSummarizer().setMetrics("mean", "variance")

val result = dataset.map{...}.groupBy().agg(featuresAgg(col("features"), 
col("weight")), labelAgg(col("labelVec"), col("weight"))).first
val featuresMetrics = result.getAs[Map[String,Vector]](0)
val labelMetrics = result.getAs[Map[String,Vector]](1)
{code}

I have not found a way to output multi columns in udaf, so I use 
{{Map[String,Vector]}} as the output type temporarily. If there is some way, 
I'll be happy to modify this place.

> MultivariateOnlineSummarizer perfermence optimization
> 

[jira] [Created] (SPARK-19328) using spark thrift server gets memory leak problem in `ExecutorsListener`

2017-01-22 Thread roncenzhao (JIRA)
roncenzhao created SPARK-19328:
--

 Summary: using spark thrift server gets memory leak problem in 
`ExecutorsListener`
 Key: SPARK-19328
 URL: https://issues.apache.org/jira/browse/SPARK-19328
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2
 Environment: spark2.0.2
Reporter: roncenzhao


When I use spark-thrift-server, the memory usage will gradually increase. 
I found that in `ExecutorsListener` all the data about executors is never 
trimmed.

So do we need some operations to trim the data ?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer perfermence optimization

2017-01-22 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833345#comment-15833345
 ] 

zhengruifeng edited comment on SPARK-19208 at 1/22/17 8:00 AM:
---

After diving into sparksql's udaf, I design the new api like this:

new MultivariateOnlineSummarizer in org.apache.spark.ml.stat
{code}
class MultivariateOnlineSummarizer(private var metrics: Seq[String]) extends 
UserDefinedAggregateFunction {
def setMetrics(metrics: Seq[String]) = ...
def setMetrics(metric: String, others: String*) = ...
override def inputSchema: StructType = new StructType().add("weight", 
DoubleType).add("features", new VectorUDT)
override def bufferSchema: StructType = ...
override def dataType: DataType = DataTypes.createMapType(StringType, new 
VectorUDT)
override def deterministic: Boolean = true
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = 
...
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = 
...
override def evaluate(buffer: Row): Map[String. Vector] = ...
}
{code}

usage examples:
{code}
// in MinMaxScaler
val maxAbsAgg = new MultivariateOnlineSummarizer().setMetrics("min", "max")
val summary = dataset.groupBy().agg(maxAbsAgg(col("features"), col("weight")))
summary.show
+-+
|multivariateonlinesummarizer(features, weight)|
+-+
| Map(min -> [1.0,0...|
+-+
summary.first
res2: org.apache.spark.sql.Row = [Map(min -> [1.0,0.2,0.2], max -> 
[1.0,0.2,0.2])]
val result = summary.first.getAs[Map[String,Vector]](0)
val min: Vector = result("min")
val max: Vector = result("max")

// in LinearRegression
val featuresAgg = new MultivariateOnlineSummarizer().setMetrics("mean", 
"variance")
val labelAgg = new MultivariateOnlineSummarizer().setMetrics("mean", "variance")

val result = dataset.map{...}.groupBy().agg(featuresAgg(col("features"), 
col("weight")), labelAgg(col("labelVec"), col("weight"))).first
val featuresMetrics = result.getAs[Map[String,Vector]](0)
val labelMetrics = result.getAs[Map[String,Vector]](1)
{code}

I have not found a way to output multi columns in udaf, so I use 
{{Map[String,Vector]}} as the output type temporarily. If there is some way, 
I'll be happy to modify this place.


was (Author: podongfeng):
After diving into sparksql's udaf, I design the new api like this:

new MultivariateOnlineSummarizer in org.apache.spark.ml.stat
{code}
class MultivariateOnlineSummarizer(private var metrics: Seq[String]) extends 
UserDefinedAggregateFunction {
def setMetrics(metrics: Seq[String]) = ...
def setMetrics(metric: String, others: String*) = ...
override def inputSchema: StructType = new StructType().add("weight", 
DoubleType).add("features", new VectorUDT)
override def bufferSchema: StructType = ...
override def dataType: DataType = DataTypes.createMapType(StringType, new 
VectorUDT)
override def deterministic: Boolean = true
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = 
...
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = 
...
override def evaluate(buffer: Row): Vector = ...
}
{code}

usage examples:
{code}
// in MinMaxScaler
val maxAbsAgg = new MultivariateOnlineSummarizer().setMetrics("min", "max")
val summary = dataset.groupBy().agg(maxAbsAgg(col("features"), col("weight")))
summary.show
+-+
|multivariateonlinesummarizer(features, weight)|
+-+
| Map(min -> [1.0,0...|
+-+
summary.first
res2: org.apache.spark.sql.Row = [Map(min -> [1.0,0.2,0.2], max -> 
[1.0,0.2,0.2])]
val result = summary.first.getAs[Map[String,Vector]](0)
val min: Vector = result("min")
val max: Vector = result("max")

// in LinearRegression
val featuresAgg = new MultivariateOnlineSummarizer().setMetrics("mean", 
"variance")
val labelAgg = new MultivariateOnlineSummarizer().setMetrics("mean", "variance")

val result = dataset.map{...}.groupBy().agg(featuresAgg(col("features"), 
col("weight")), labelAgg(col("labelVec"), col("weight"))).first
val featuresMetrics = result.getAs[Map[String,Vector]](0)
val labelMetrics = result.getAs[Map[String,Vector]](1)
{code}

I have not found a way to output multi columns in udaf, so I use 
{{Map[String,Vector]}} as the output type temporarily. If there is some way, 
I'll be happy to modify this place.

> MultivariateOnlineSummarizer perfermence optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} 

[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer perfermence optimization

2017-01-22 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833345#comment-15833345
 ] 

zhengruifeng edited comment on SPARK-19208 at 1/22/17 8:01 AM:
---

After diving into sparksql's udaf, I design the new api like this:

new MultivariateOnlineSummarizer in org.apache.spark.ml.stat
{code}
class MultivariateOnlineSummarizer(private var metrics: Seq[String]) extends 
UserDefinedAggregateFunction {
def setMetrics(metrics: Seq[String]) = ...
def setMetrics(metric: String, others: String*) = ...
override def inputSchema: StructType = new StructType().add("weight", 
DoubleType).add("features", new VectorUDT)
override def bufferSchema: StructType = ...
override def dataType: DataType = DataTypes.createMapType(StringType, new 
VectorUDT)
override def deterministic: Boolean = true
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = 
...
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = 
...
override def evaluate(buffer: Row): Map[String,Vector] = ...
}
{code}

usage examples:
{code}
// in MinMaxScaler
val maxAbsAgg = new MultivariateOnlineSummarizer().setMetrics("min", "max")
val summary = dataset.groupBy().agg(maxAbsAgg(col("features"), col("weight")))
summary.show
+-+
|multivariateonlinesummarizer(features, weight)|
+-+
| Map(min -> [1.0,0...|
+-+
summary.first
res2: org.apache.spark.sql.Row = [Map(min -> [1.0,0.2,0.2], max -> 
[1.0,0.2,0.2])]
val result = summary.first.getAs[Map[String,Vector]](0)
val min: Vector = result("min")
val max: Vector = result("max")

// in LinearRegression
val featuresAgg = new MultivariateOnlineSummarizer().setMetrics("mean", 
"variance")
val labelAgg = new MultivariateOnlineSummarizer().setMetrics("mean", "variance")

val result = dataset.map{...}.groupBy().agg(featuresAgg(col("features"), 
col("weight")), labelAgg(col("labelVec"), col("weight"))).first
val featuresMetrics = result.getAs[Map[String,Vector]](0)
val labelMetrics = result.getAs[Map[String,Vector]](1)
{code}

I have not found a way to output multi columns in udaf, so I use 
{{Map[String,Vector]}} as the output type temporarily. If there is some way, 
I'll be happy to modify this place.


was (Author: podongfeng):
After diving into sparksql's udaf, I design the new api like this:

new MultivariateOnlineSummarizer in org.apache.spark.ml.stat
{code}
class MultivariateOnlineSummarizer(private var metrics: Seq[String]) extends 
UserDefinedAggregateFunction {
def setMetrics(metrics: Seq[String]) = ...
def setMetrics(metric: String, others: String*) = ...
override def inputSchema: StructType = new StructType().add("weight", 
DoubleType).add("features", new VectorUDT)
override def bufferSchema: StructType = ...
override def dataType: DataType = DataTypes.createMapType(StringType, new 
VectorUDT)
override def deterministic: Boolean = true
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = 
...
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = 
...
override def evaluate(buffer: Row): Map[String. Vector] = ...
}
{code}

usage examples:
{code}
// in MinMaxScaler
val maxAbsAgg = new MultivariateOnlineSummarizer().setMetrics("min", "max")
val summary = dataset.groupBy().agg(maxAbsAgg(col("features"), col("weight")))
summary.show
+-+
|multivariateonlinesummarizer(features, weight)|
+-+
| Map(min -> [1.0,0...|
+-+
summary.first
res2: org.apache.spark.sql.Row = [Map(min -> [1.0,0.2,0.2], max -> 
[1.0,0.2,0.2])]
val result = summary.first.getAs[Map[String,Vector]](0)
val min: Vector = result("min")
val max: Vector = result("max")

// in LinearRegression
val featuresAgg = new MultivariateOnlineSummarizer().setMetrics("mean", 
"variance")
val labelAgg = new MultivariateOnlineSummarizer().setMetrics("mean", "variance")

val result = dataset.map{...}.groupBy().agg(featuresAgg(col("features"), 
col("weight")), labelAgg(col("labelVec"), col("weight"))).first
val featuresMetrics = result.getAs[Map[String,Vector]](0)
val labelMetrics = result.getAs[Map[String,Vector]](1)
{code}

I have not found a way to output multi columns in udaf, so I use 
{{Map[String,Vector]}} as the output type temporarily. If there is some way, 
I'll be happy to modify this place.

> MultivariateOnlineSummarizer perfermence optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now,