[jira] [Commented] (SPARK-17868) Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS

2016-10-11 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564751#comment-15564751
 ] 

Jiang Xingbo commented on SPARK-17868:
--

Yes, I'll be working on this. Thank you!

> Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS
> 
>
> Key: SPARK-17868
> URL: https://issues.apache.org/jira/browse/SPARK-17868
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> We generate bitmasks for grouping sets during the parsing process, and use 
> these during analysis. These bitmasks are difficult to work with in practice 
> and have lead to numerous bugs. I suggest that we remove these and use actual 
> sets instead, however we would need to generate these offsets for the 
> grouping_id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17825) Expose log likelihood of EM algorithm in mllib

2016-10-11 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564750#comment-15564750
 ] 

zhengruifeng commented on SPARK-17825:
--

This jira seems a duplicate of [Spark-14272]

> Expose log likelihood of EM algorithm in mllib
> --
>
> Key: SPARK-17825
> URL: https://issues.apache.org/jira/browse/SPARK-17825
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Lei Wang
>
> Users sometimes need to get log likelihood of EM algorithm.
> For example, one might use this value to choose appropriate cluster number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17825) Expose log likelihood of EM algorithm in mllib

2016-10-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17825.
---
Resolution: Duplicate

> Expose log likelihood of EM algorithm in mllib
> --
>
> Key: SPARK-17825
> URL: https://issues.apache.org/jira/browse/SPARK-17825
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Lei Wang
>
> Users sometimes need to get log likelihood of EM algorithm.
> For example, one might use this value to choose appropriate cluster number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17864) Mark data type APIs as stable, rather than DeveloperApi

2016-10-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17864.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15426
[https://github.com/apache/spark/pull/15426]

> Mark data type APIs as stable, rather than DeveloperApi
> ---
>
> Key: SPARK-17864
> URL: https://issues.apache.org/jira/browse/SPARK-17864
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: releasenotes
> Fix For: 2.1.0
>
>
> The data type API has not been changed since Spark 1.3.0, and is ready for 
> graduation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564786#comment-15564786
 ] 

Apache Spark commented on SPARK-17219:
--

User 'VinceShieh' has created a pull request for this issue:
https://github.com/apache/spark/pull/15428

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>Assignee: Vincent
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17869) Connect to Amazon S3 using signature version 4 (only choice in Frankfurt)

2016-10-11 Thread Robin B (JIRA)
Robin B created SPARK-17869:
---

 Summary: Connect to Amazon S3 using signature version 4 (only 
choice in Frankfurt)
 Key: SPARK-17869
 URL: https://issues.apache.org/jira/browse/SPARK-17869
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.1, 2.0.0
 Environment: Mac OS X / Ubuntu
pyspark
hadoop-aws:2.7.3
aws-java-sdk:1.11.41
Reporter: Robin B


Connection fails with **400 Bad request** for S3 in Frankfurt region where 
version 4 authentication is needed to connect. 

This issue is somewhat related 
https://issues.apache.org/jira/browse/HADOOP-13325>, but the 
solution (to include the endpoint explicitly) does nothing to ameliorate the 
problem.


sc._jsc.hadoopConfiguration().set('fs.s3a.impl','org.apache.hadoop.fs.s3native.NativeS3FileSystem')

sc._jsc.hadoopConfiguration().set('com.amazonaws.services.s3.enableV4','true')

sc.setSystemProperty('SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY','true')

sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint','s3.eu-central-1.amazonaws.com')
sc._jsc.hadoopConfiguration().set('fs.s3a.awsAccessKeyId','ACCESS_KEY')
sc._jsc.hadoopConfiguration().set('fs.s3a.awsSecretAccessKey','SECRET_KEY')

df = spark.read.csv("s3a://BUCKET-NAME/filename.csv")

yields:

16/10/10 18:39:28 WARN DataSource: Error while looking for metadata 
directory.
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/readwriter.py",
 line 363, in csv
return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
 line 933, in __call__
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/utils.py", 
line 63, in deco
return f(*a, **kw)
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
 line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
: java.io.IOException: s3n://BUCKET-NAME : 400 : Bad Request
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at 
org.apache.hadoop.fs.s3native.$Proxy7.retrieveMetadata(Unknown Source)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:360)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at 
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at 
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:401)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.re

[jira] [Updated] (SPARK-17869) Connect to Amazon S3 using signature version 4 (only choice in Frankfurt)

2016-10-11 Thread Robin B (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin B updated SPARK-17869:

Description: 
Connection fails with **400 Bad request** for S3 in Frankfurt region where 
version 4 authentication is needed to connect. 

This issue is somewhat related HADOOP-13325, but the solution (to include the 
endpoint explicitly) does nothing to ameliorate the problem.


sc._jsc.hadoopConfiguration().set('fs.s3a.impl','org.apache.hadoop.fs.s3native.NativeS3FileSystem')

sc._jsc.hadoopConfiguration().set('com.amazonaws.services.s3.enableV4','true')

sc.setSystemProperty('SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY','true')

sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint','s3.eu-central-1.amazonaws.com')
sc._jsc.hadoopConfiguration().set('fs.s3a.awsAccessKeyId','ACCESS_KEY')
sc._jsc.hadoopConfiguration().set('fs.s3a.awsSecretAccessKey','SECRET_KEY')

df = spark.read.csv("s3a://BUCKET-NAME/filename.csv")

yields:

16/10/10 18:39:28 WARN DataSource: Error while looking for metadata 
directory.
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/readwriter.py",
 line 363, in csv
return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
 line 933, in __call__
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/utils.py", 
line 63, in deco
return f(*a, **kw)
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
 line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
: java.io.IOException: s3n://BUCKET-NAME : 400 : Bad Request
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at 
org.apache.hadoop.fs.s3native.$Proxy7.retrieveMetadata(Unknown Source)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:360)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at 
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at 
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:401)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at 
py4j.commands.AbstractCommand.invok

[jira] [Commented] (SPARK-17869) Connect to Amazon S3 using signature version 4 (only choice in Frankfurt)

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564801#comment-15564801
 ] 

Sean Owen commented on SPARK-17869:
---

This isn't a Spark issue, right? it's an issue with S3 config in your app or 
the S3 library.

> Connect to Amazon S3 using signature version 4 (only choice in Frankfurt)
> -
>
> Key: SPARK-17869
> URL: https://issues.apache.org/jira/browse/SPARK-17869
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1
> Environment: Mac OS X / Ubuntu
> pyspark
> hadoop-aws:2.7.3
> aws-java-sdk:1.11.41
>Reporter: Robin B
>
> Connection fails with **400 Bad request** for S3 in Frankfurt region where 
> version 4 authentication is needed to connect. 
> This issue is somewhat related HADOOP-13325, but the solution (to include the 
> endpoint explicitly) does nothing to ameliorate the problem.
> 
> sc._jsc.hadoopConfiguration().set('fs.s3a.impl','org.apache.hadoop.fs.s3native.NativeS3FileSystem')
> 
> sc._jsc.hadoopConfiguration().set('com.amazonaws.services.s3.enableV4','true')
> 
> sc.setSystemProperty('SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY','true')
> 
> sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint','s3.eu-central-1.amazonaws.com')
> sc._jsc.hadoopConfiguration().set('fs.s3a.awsAccessKeyId','ACCESS_KEY')
> 
> sc._jsc.hadoopConfiguration().set('fs.s3a.awsSecretAccessKey','SECRET_KEY')
> df = spark.read.csv("s3a://BUCKET-NAME/filename.csv")
> yields:
>   16/10/10 18:39:28 WARN DataSource: Error while looking for metadata 
> directory.
>   Traceback (most recent call last):
> File "", line 1, in 
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/readwriter.py",
>  line 363, in csv
>   return 
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
>   return f(*a, **kw)
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
>   py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
>   : java.io.IOException: s3n://BUCKET-NAME : 400 : Bad Request
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at 
> org.apache.hadoop.fs.s3native.$Proxy7.retrieveMetadata(Unknown Source)
>   at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:360)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>   at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>  

[jira] [Commented] (SPARK-17840) Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in PULL_REQUEST_TEMPLATE

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564814#comment-15564814
 ] 

Apache Spark commented on SPARK-17840:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15429

> Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in 
> PULL_REQUEST_TEMPLATE
> --
>
> Key: SPARK-17840
> URL: https://issues.apache.org/jira/browse/SPARK-17840
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> It seems it'd better to give best efforts for describing the guides for 
> creating JIRAs and PRs as there are several discussions about this.
> Maybe,
>  - We could add some pointers of {{CONTRIBUTING.md}} or wiki[1] in README.md 
>  - Adds an explicit warning in PULL_REQUEST_TEMPLATE[2], in particular, for 
> some PRs trying to merge a branch to another branch, for example as below:
> {quote}
> Please double check if your pull request is from a branch to a branch. It is 
> probably wrong if the pull request title is "Branch 2.0" or "Merge ..." or if 
> your pull request includes many commit logs unrelated to your changes to 
> propose. 
> {quote}
> Please refer the mailing list[3].
> [1] https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> [2] https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
> [3] 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201610.mbox/%3CCAMAsSdKYEvvU%2BPdZ0oFDsvwod6wRRavqRyYmjc7qYhEp7%2B_3eg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17840) Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in PULL_REQUEST_TEMPLATE

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17840:


Assignee: (was: Apache Spark)

> Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in 
> PULL_REQUEST_TEMPLATE
> --
>
> Key: SPARK-17840
> URL: https://issues.apache.org/jira/browse/SPARK-17840
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> It seems it'd better to give best efforts for describing the guides for 
> creating JIRAs and PRs as there are several discussions about this.
> Maybe,
>  - We could add some pointers of {{CONTRIBUTING.md}} or wiki[1] in README.md 
>  - Adds an explicit warning in PULL_REQUEST_TEMPLATE[2], in particular, for 
> some PRs trying to merge a branch to another branch, for example as below:
> {quote}
> Please double check if your pull request is from a branch to a branch. It is 
> probably wrong if the pull request title is "Branch 2.0" or "Merge ..." or if 
> your pull request includes many commit logs unrelated to your changes to 
> propose. 
> {quote}
> Please refer the mailing list[3].
> [1] https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> [2] https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
> [3] 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201610.mbox/%3CCAMAsSdKYEvvU%2BPdZ0oFDsvwod6wRRavqRyYmjc7qYhEp7%2B_3eg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17840) Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in PULL_REQUEST_TEMPLATE

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17840:


Assignee: Apache Spark

> Add some pointers for wiki/CONTRIBUTING.md in README.md and some warnings in 
> PULL_REQUEST_TEMPLATE
> --
>
> Key: SPARK-17840
> URL: https://issues.apache.org/jira/browse/SPARK-17840
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> It seems it'd better to give best efforts for describing the guides for 
> creating JIRAs and PRs as there are several discussions about this.
> Maybe,
>  - We could add some pointers of {{CONTRIBUTING.md}} or wiki[1] in README.md 
>  - Adds an explicit warning in PULL_REQUEST_TEMPLATE[2], in particular, for 
> some PRs trying to merge a branch to another branch, for example as below:
> {quote}
> Please double check if your pull request is from a branch to a branch. It is 
> probably wrong if the pull request title is "Branch 2.0" or "Merge ..." or if 
> your pull request includes many commit logs unrelated to your changes to 
> propose. 
> {quote}
> Please refer the mailing list[3].
> [1] https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> [2] https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
> [3] 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201610.mbox/%3CCAMAsSdKYEvvU%2BPdZ0oFDsvwod6wRRavqRyYmjc7qYhEp7%2B_3eg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17869) Connect to Amazon S3 using signature version 4 (only choice in Frankfurt)

2016-10-11 Thread Robin B (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin B closed SPARK-17869.
---
Resolution: Won't Fix

You are right [~srowen]

> Connect to Amazon S3 using signature version 4 (only choice in Frankfurt)
> -
>
> Key: SPARK-17869
> URL: https://issues.apache.org/jira/browse/SPARK-17869
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1
> Environment: Mac OS X / Ubuntu
> pyspark
> hadoop-aws:2.7.3
> aws-java-sdk:1.11.41
>Reporter: Robin B
>
> Connection fails with **400 Bad request** for S3 in Frankfurt region where 
> version 4 authentication is needed to connect. 
> This issue is somewhat related HADOOP-13325, but the solution (to include the 
> endpoint explicitly) does nothing to ameliorate the problem.
> 
> sc._jsc.hadoopConfiguration().set('fs.s3a.impl','org.apache.hadoop.fs.s3native.NativeS3FileSystem')
> 
> sc._jsc.hadoopConfiguration().set('com.amazonaws.services.s3.enableV4','true')
> 
> sc.setSystemProperty('SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY','true')
> 
> sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint','s3.eu-central-1.amazonaws.com')
> sc._jsc.hadoopConfiguration().set('fs.s3a.awsAccessKeyId','ACCESS_KEY')
> 
> sc._jsc.hadoopConfiguration().set('fs.s3a.awsSecretAccessKey','SECRET_KEY')
> df = spark.read.csv("s3a://BUCKET-NAME/filename.csv")
> yields:
>   16/10/10 18:39:28 WARN DataSource: Error while looking for metadata 
> directory.
>   Traceback (most recent call last):
> File "", line 1, in 
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/readwriter.py",
>  line 363, in csv
>   return 
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
>   return f(*a, **kw)
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
>   py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
>   : java.io.IOException: s3n://BUCKET-NAME : 400 : Bad Request
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at 
> org.apache.hadoop.fs.s3native.$Proxy7.retrieveMetadata(Unknown Source)
>   at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:360)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
>   at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:401)
>   at sun.r

[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564828#comment-15564828
 ] 

Sean Owen commented on SPARK-17219:
---

Yeah, unless you return some complex object with normal buckets and a specially 
separated other bucket, then returning all the buckets together implies some 
kind of ordering, even if you need not _necessarily_ think of buckets as 
ordered. They're just counts. It does open this up to misinterpretation from 
careless callers.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>Assignee: Vincent
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17821) Expression Canonicalization should support Add and Or

2016-10-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17821:

Assignee: Liang-Chi Hsieh

> Expression Canonicalization should support Add and Or
> -
>
> Key: SPARK-17821
> URL: https://issues.apache.org/jira/browse/SPARK-17821
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> Currently {{Canonicalize}} object doesn't support {{And}} and {{Or}}. So we 
> can compare canonicalized form of predicates consistently. We should add the 
> support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17821) Expression Canonicalization should support Add and Or

2016-10-11 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17821.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15388
[https://github.com/apache/spark/pull/15388]

> Expression Canonicalization should support Add and Or
> -
>
> Key: SPARK-17821
> URL: https://issues.apache.org/jira/browse/SPARK-17821
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> Currently {{Canonicalize}} object doesn't support {{And}} and {{Or}}. So we 
> can compare canonicalized form of predicates consistently. We should add the 
> support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15957) RFormula supports forcing to index label

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564876#comment-15564876
 ] 

Apache Spark commented on SPARK-15957:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/15430

> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
>
> RFormula will index label only when it is string type currently. If the label 
> is numeric type and we use RFormula to present a classification model, there 
> is no label attributes in label column metadata. The label attributes are 
> useful when making prediction for classification, so we can force to index 
> label by {{StringIndexer}} whether it is numeric or string type for 
> classification. Then SparkR wrappers can extract label attributes from label 
> column metadata successfully. This feature can help us to fix bug similar 
> with SPARK-15153.
> For regression, we will still to keep label as numeric type.
> In this PR, we add a param indexLabel to control whether to force to index 
> label for RFormula.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17854) it will failed when do select rand(null)

2016-10-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564884#comment-15564884
 ] 

Hyukjin Kwon commented on SPARK-17854:
--

It seems this can be quickly fixed. Please let me submit a PR.

> it will failed when do select rand(null)
> 
>
> Key: SPARK-17854
> URL: https://issues.apache.org/jira/browse/SPARK-17854
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark version:2.0.0
> Linuxversion: Suse 11.3
>Reporter: chenerlu
>Priority: Minor
>
> In spark2.0, when run spark-sql and try to do select rand(null), it failed 
> and the error message as follows:
> > select rand(null);
> INFO SparkSqlParser: Parsing command: select rand(null)
> Error in query: Input argument to rand must be an integer literal.;; line 1 
> pos 7
> In hive or mysql, I do the same operation(select rand(null)), it will success 
> and the value of rand(null) is the same with rand(0).
> this issue has fixed in Hive, the jira link as follows:
> https://issues.apache.org/jira/browse/HIVE-14694



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)
Peng Meng created SPARK-17870:
-

 Summary: ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
 Key: SPARK-17870
 URL: https://issues.apache.org/jira/browse/SPARK-17870
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Peng Meng
Priority: Critical


The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
(line 233) is wrong.

For feature selection method ChiSquareSelector, it is based on the 
ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
select the features with the largest ChiSqure value. But the Degree of Freedom 
(df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for 
different df, you cannot base on ChiSqure value to select features.

Because of the wrong method to count ChiSquare value, the feature selection 
results are strange.
Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
If use selectKBest to select: the feature 3 will be selected.
If use selectFpr to select: feature 1 and 2 will be selected. 
This is strange. 

I use scikit learn to test the same data with the same parameters. 
When use selectKBest to select: feature 1 will be selected. 
When use selectFpr to select: feature 1 and 2 will be selected. 
This result is make sense. because the df of each feature in scikit learn is 
the same.

I plan to submit a PR for this problem.
 

 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17784) Add fromCenters method for KMeans

2016-10-11 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564946#comment-15564946
 ] 

Nick Pentreath commented on SPARK-17784:


It's actually to create a new `KMeans` estimator I believe

> Add fromCenters method for KMeans
> -
>
> Key: SPARK-17784
> URL: https://issues.apache.org/jira/browse/SPARK-17784
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
>
> Add a new factory method fromCenters(centers: Array[Vector]) for KMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17784) Add fromCenters method for KMeans

2016-10-11 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564946#comment-15564946
 ] 

Nick Pentreath edited comment on SPARK-17784 at 10/11/16 8:59 AM:
--

It's actually to create a new `KMeans` estimator I believe - for warm-starting 
training from a previous set of centers.


was (Author: mlnick):
It's actually to create a new `KMeans` estimator I believe

> Add fromCenters method for KMeans
> -
>
> Key: SPARK-17784
> URL: https://issues.apache.org/jira/browse/SPARK-17784
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
>
> Add a new factory method fromCenters(centers: Array[Vector]) for KMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564959#comment-15564959
 ] 

Sean Owen commented on SPARK-17870:
---

Oof, I'm pretty certain you're correct. You can rank on the p-value (which is a 
function of DoF) but not the raw statistic. It's an easy change at least 
because this is already computed. Can't believe I missed that.

> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> -
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14272) Evaluate GaussianMixtureModel with LogLikelihood

2016-10-11 Thread Lei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15564986#comment-15564986
 ] 

Lei Wang commented on SPARK-14272:
--

Is this still in progress? 

> Evaluate GaussianMixtureModel with LogLikelihood
> 
>
> Key: SPARK-14272
> URL: https://issues.apache.org/jira/browse/SPARK-14272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> GMM use EM to maximum the likelihood of data. So likelihood can be a useful 
> metric to evaluate GaussianMixtureModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565004#comment-15565004
 ] 

Apache Spark commented on SPARK-15153:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/15431

> SparkR spark.naiveBayes throws error when label is numeric type
> ---
>
> Key: SPARK-15153
> URL: https://issues.apache.org/jira/browse/SPARK-15153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> When the label of dataset is numeric type, SparkR spark.naiveBayes will throw 
> error. This bug is easy to reproduce:
> {code}
> t <- as.data.frame(Titanic)
> t1 <- t[t$Freq > 0, -5]
> t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1)
> t2 <- t1[-4]
> df <- suppressWarnings(createDataFrame(sqlContext, t2))
> m <- spark.naiveBayes(df, NumericSurvived ~ .)
> 16/05/05 03:26:17 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.ClassCastException: 
> org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to 
> org.apache.spark.ml.attribute.NominalAttribute
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at io.netty.channel.AbstractChannelHandlerContext.invo
> {code}
> In RFormula, the response variable type could be string or numeric. If it's 
> string, RFormula will transform it to label of DoubleType by StringIndexer 
> and set corresponding column metadata; otherwise, RFormula will directly use 
> it as label when training model (and assumes that it was numbered from 0, 
> ..., maxLabelIndex). 
> When we extract labels at ml.r.NaiveBayesWrapper, we should handle it 
> according the type of the response variable (string or numeric).
> cc [~mengxr] [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17854) it will failed when do select rand(null)

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17854:


Assignee: Apache Spark

> it will failed when do select rand(null)
> 
>
> Key: SPARK-17854
> URL: https://issues.apache.org/jira/browse/SPARK-17854
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark version:2.0.0
> Linuxversion: Suse 11.3
>Reporter: chenerlu
>Assignee: Apache Spark
>Priority: Minor
>
> In spark2.0, when run spark-sql and try to do select rand(null), it failed 
> and the error message as follows:
> > select rand(null);
> INFO SparkSqlParser: Parsing command: select rand(null)
> Error in query: Input argument to rand must be an integer literal.;; line 1 
> pos 7
> In hive or mysql, I do the same operation(select rand(null)), it will success 
> and the value of rand(null) is the same with rand(0).
> this issue has fixed in Hive, the jira link as follows:
> https://issues.apache.org/jira/browse/HIVE-14694



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17854) it will failed when do select rand(null)

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17854:


Assignee: (was: Apache Spark)

> it will failed when do select rand(null)
> 
>
> Key: SPARK-17854
> URL: https://issues.apache.org/jira/browse/SPARK-17854
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark version:2.0.0
> Linuxversion: Suse 11.3
>Reporter: chenerlu
>Priority: Minor
>
> In spark2.0, when run spark-sql and try to do select rand(null), it failed 
> and the error message as follows:
> > select rand(null);
> INFO SparkSqlParser: Parsing command: select rand(null)
> Error in query: Input argument to rand must be an integer literal.;; line 1 
> pos 7
> In hive or mysql, I do the same operation(select rand(null)), it will success 
> and the value of rand(null) is the same with rand(0).
> this issue has fixed in Hive, the jira link as follows:
> https://issues.apache.org/jira/browse/HIVE-14694



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17854) it will failed when do select rand(null)

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565015#comment-15565015
 ] 

Apache Spark commented on SPARK-17854:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15432

> it will failed when do select rand(null)
> 
>
> Key: SPARK-17854
> URL: https://issues.apache.org/jira/browse/SPARK-17854
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark version:2.0.0
> Linuxversion: Suse 11.3
>Reporter: chenerlu
>Priority: Minor
>
> In spark2.0, when run spark-sql and try to do select rand(null), it failed 
> and the error message as follows:
> > select rand(null);
> INFO SparkSqlParser: Parsing command: select rand(null)
> Error in query: Input argument to rand must be an integer literal.;; line 1 
> pos 7
> In hive or mysql, I do the same operation(select rand(null)), it will success 
> and the value of rand(null) is the same with rand(0).
> this issue has fixed in Hive, the jira link as follows:
> https://issues.apache.org/jira/browse/HIVE-14694



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565041#comment-15565041
 ] 

Peng Meng commented on SPARK-17870:
---

hi [~srowen], thanks very much for you quickly reply. 
yes,the p-value is better than raw statistic in this case, because p-value is 
count  based on DoF and raw statistic.
raw statistic is also popular for feature selection. The SelectKBest and 
SelectPercentile in scikit learn is based on raw statistic. 
The question here is we should use the same DoF like scikit learn to count 
ChiSquare value. 
For this JIRA, I propose to change the method to count ChiSquare value like 
what is done in scikit learn (change Statistics.chiSqTest(RDD)). 

Thanks very much.  

> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> -
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14272) Evaluate GaussianMixtureModel with LogLikelihood

2016-10-11 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565069#comment-15565069
 ] 

zhengruifeng edited comment on SPARK-14272 at 10/11/16 10:07 AM:
-

Yes, I will make a update after SPARK-17847 get merged


was (Author: podongfeng):
Yes, I will a update after SPARK-17847 get merged

> Evaluate GaussianMixtureModel with LogLikelihood
> 
>
> Key: SPARK-14272
> URL: https://issues.apache.org/jira/browse/SPARK-14272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> GMM use EM to maximum the likelihood of data. So likelihood can be a useful 
> metric to evaluate GaussianMixtureModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14272) Evaluate GaussianMixtureModel with LogLikelihood

2016-10-11 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565069#comment-15565069
 ] 

zhengruifeng commented on SPARK-14272:
--

Yes, I will a update after SPARK-17847 get merged

> Evaluate GaussianMixtureModel with LogLikelihood
> 
>
> Key: SPARK-14272
> URL: https://issues.apache.org/jira/browse/SPARK-14272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> GMM use EM to maximum the likelihood of data. So likelihood can be a useful 
> metric to evaluate GaussianMixtureModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14272) Evaluate GaussianMixtureModel with LogLikelihood

2016-10-11 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14272:
-
Component/s: (was: MLlib)
 ML

> Evaluate GaussianMixtureModel with LogLikelihood
> 
>
> Key: SPARK-14272
> URL: https://issues.apache.org/jira/browse/SPARK-14272
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> GMM use EM to maximum the likelihood of data. So likelihood can be a useful 
> metric to evaluate GaussianMixtureModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565180#comment-15565180
 ] 

Sean Owen commented on SPARK-17870:
---

I don't think the raw statistic can be directly compared here because the 
features do not have even nearly the same number of 'buckets', not necessarily. 
A given test statistic value is "less remarkable" when there are more DoF; 
what's high for a binary-valued feature may not be high at all for one taking 
on 100 values.

Does scikit really use the statistic? because you're also saying it does 
something that gives different results from ranking on the statistic.

> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> -
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17656) Decide on the variant of @scala.annotation.varargs and use consistently

2016-10-11 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17656.
--
Resolution: Fixed

This was fixed in the PR together.

> Decide on the variant of @scala.annotation.varargs and use consistently
> ---
>
> Key: SPARK-17656
> URL: https://issues.apache.org/jira/browse/SPARK-17656
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Reynold Xin
>Priority: Trivial
>
> After the [discussion at 
> dev@spark|http://apache-spark-developers-list.1001551.n3.nabble.com/scala-annotation-varargs-or-root-scala-annotation-varargs-td18898.html]
>  it appears there's a consensus to review the use of 
> {{@scala.annotation.varargs}} throughout the codebase and use one variant and 
> use it consistently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565225#comment-15565225
 ] 

Peng Meng commented on SPARK-17870:
---

yes, the selectKBest and selectPercentile in scikit learn only use statistic.
Because the method to count ChiSquare value is different, the DoF of all 
features in scikit learn are the same. so it can do that.

The ChiSquare Value compute process is like this:
 suppose we have data:
X = [ 8 7 0
 0 9 6
 0 9 8
 8 9 5]
y = [0 1 1 2]T, this is the test suite data of 
ml/feature/ChiSquareSelectorSuite.scala
sci-kit learn to compute chiSquare value is like this:
first:
Y = [1 0 0
0 1 0
0  1 0
0  0 1]
observed = Y'*X=
[8  70
 0  18 14
 8   9   5]
expected = 
[4 8.5 4.75
 8 17  9.5
 4  8.5  4.75]
_chisquare(ovserved, expected): to compute all features ChiSquare value, we can 
see all the DF of each feature is the same.

Bug for spark Statistics.chiSqTest(RDD), is use another method, for each 
feature, construct a contingency table. So the DF is different for each 
feature.  

For "gives different results from ranking on the statistic", this is because 
the parameters different.
For previous example, if use SelectKBest(2), the selected feature is the same 
as SelectFpr(0.2) in scikit learn


 


> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> -
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565238#comment-15565238
 ] 

Sean Owen commented on SPARK-17870:
---

I don't quite understand this example, can you point me to the source? the 
chi-squared statistic is indeed a function of observed and expected counts, but 
I'd expect those to be a vector of counts, one for each class. If you're saying 
that each row contains observed counts for one feature's classes, then yes in 
this particular construction each of them has the same number of classes 
(columns). But that isn't generally true; that can't be an assumption scikit 
makes? I bet I'm missing something.

> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> -
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565251#comment-15565251
 ] 

Peng Meng commented on SPARK-17870:
---

The scikit learn code is here: 
https://github.com/scikit-learn/scikit-learn/blob/412996f09b6756752dfd3736c306d46fca8f1aa1/sklearn/feature_selection/univariate_selection.py,
 line 422 for selectKBest, chiSquare compute is also on the same page.

For the last example, each row of X is a sample, it contain three features, 
totally 4 samples. Y is the label.
Thanks very much.  


> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> -
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17853) Kafka OffsetOutOfRangeException on DStreams union from separate Kafka clusters with identical topic names.

2016-10-11 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565278#comment-15565278
 ] 

Cody Koeninger commented on SPARK-17853:


Which version of DStream are you using, 0-10 or 0-8?
Are you using the same group id for both streams?

> Kafka OffsetOutOfRangeException on DStreams union from separate Kafka 
> clusters with identical topic names.
> --
>
> Key: SPARK-17853
> URL: https://issues.apache.org/jira/browse/SPARK-17853
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcin Kuthan
>
> During migration from Spark 1.6 to 2.0 I observed OffsetOutOfRangeException  
> reported by Kafka client. In our scenario we create single DStream as a union 
> of multiple DStreams. One DStream for one Kafka cluster (multi dc solution). 
> Both Kafka clusters have the same topics and number of partitions.
> After quick investigation, I found that class DirectKafkaInputDStream keeps 
> offset state for topic and partitions, but it is not aware of different Kafka 
> clusters. 
> For every topic, single DStream is created as a union from all configured 
> Kafka clusters.
> {code}
> class KafkaDStreamSource(configs: Iterable[Map[String, String]]) {
> def createSource(ssc: StreamingContext, topic: String): DStream[(String, 
> Array[Byte])] = {
> val streams = configs.map { config =>
>   val kafkaParams = config
>   val kafkaTopics = Set(topic)
>   KafkaUtils.
>   createDirectStream[String, Array[Byte]](
> ssc,
> LocationStrategies.PreferConsistent,
> ConsumerStrategies.Subscribe[String, Array[Byte]](kafkaTopics, 
> kafkaParams)
>   ).map { record =>
> (record.key, record.value)
>   }
> }
> ssc.union(streams.toSeq)
>   }
> }
> {code}
> At the end, offsets from one Kafka cluster overwrite offsets from second one. 
> Fortunately OffsetOutOfRangeException was thrown because offsets in both 
> Kafka clusters are significantly different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17853) Kafka OffsetOutOfRangeException on DStreams union from separate Kafka clusters with identical topic names.

2016-10-11 Thread Piotr Guzik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565307#comment-15565307
 ] 

Piotr Guzik commented on SPARK-17853:
-

Hi. We are using version 0-10. We are also using the same group id for both 
streams.

> Kafka OffsetOutOfRangeException on DStreams union from separate Kafka 
> clusters with identical topic names.
> --
>
> Key: SPARK-17853
> URL: https://issues.apache.org/jira/browse/SPARK-17853
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcin Kuthan
>
> During migration from Spark 1.6 to 2.0 I observed OffsetOutOfRangeException  
> reported by Kafka client. In our scenario we create single DStream as a union 
> of multiple DStreams. One DStream for one Kafka cluster (multi dc solution). 
> Both Kafka clusters have the same topics and number of partitions.
> After quick investigation, I found that class DirectKafkaInputDStream keeps 
> offset state for topic and partitions, but it is not aware of different Kafka 
> clusters. 
> For every topic, single DStream is created as a union from all configured 
> Kafka clusters.
> {code}
> class KafkaDStreamSource(configs: Iterable[Map[String, String]]) {
> def createSource(ssc: StreamingContext, topic: String): DStream[(String, 
> Array[Byte])] = {
> val streams = configs.map { config =>
>   val kafkaParams = config
>   val kafkaTopics = Set(topic)
>   KafkaUtils.
>   createDirectStream[String, Array[Byte]](
> ssc,
> LocationStrategies.PreferConsistent,
> ConsumerStrategies.Subscribe[String, Array[Byte]](kafkaTopics, 
> kafkaParams)
>   ).map { record =>
> (record.key, record.value)
>   }
> }
> ssc.union(streams.toSeq)
>   }
> }
> {code}
> At the end, offsets from one Kafka cluster overwrite offsets from second one. 
> Fortunately OffsetOutOfRangeException was thrown because offsets in both 
> Kafka clusters are significantly different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565315#comment-15565315
 ] 

Peng Meng commented on SPARK-17870:
---

https://github.com/apache/spark/pull/1484#issuecomment-51024568
Hi [~mengxr] and [~avulanov] , what do you think about this JIRA. 

> ML/MLLIB: Statistics.chiSqTest(RDD) is wrong 
> -
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17871) Dataset joinwith syntax should support specifying the condition in a compile-time safe way

2016-10-11 Thread Jamie Hutton (JIRA)
Jamie Hutton created SPARK-17871:


 Summary: Dataset joinwith syntax should support specifying the 
condition in a compile-time safe way
 Key: SPARK-17871
 URL: https://issues.apache.org/jira/browse/SPARK-17871
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Jamie Hutton


One of the great things about datasets is the way it enables compile time 
type-safety. However the joinWith method currently only supports a "condition" 
which is specified as a column, meaning we have to reference the columns by 
name, removing compile time checking and leading to errors

It would be great if spark could support a join method which allowed 
compile-time checking of the condition. Something like:

leftDS.joinWith(rightDS, case(l,r)=>l.id=r.id"))

This would have the added benefit of solving a serialization issue which stops 
joinwith working when using kryo (because with kryo we just one one column on 
binary called "value" representing the entire object, and we need to be able to 
join on items within the object - more info here: 
http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset).
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17822:


Assignee: (was: Apache Spark)

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>
> Seems it is pretty easy to remove objects from JVMObjectTracker.objMap. So, 
> seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565374#comment-15565374
 ] 

Apache Spark commented on SPARK-17822:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/15433

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>
> Seems it is pretty easy to remove objects from JVMObjectTracker.objMap. So, 
> seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17822:


Assignee: Apache Spark

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> Seems it is pretty easy to remove objects from JVMObjectTracker.objMap. So, 
> seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17853) Kafka OffsetOutOfRangeException on DStreams union from separate Kafka clusters with identical topic names.

2016-10-11 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565400#comment-15565400
 ] 

Cody Koeninger commented on SPARK-17853:


Use a different group id.

Let me know if that addresses the issue.




> Kafka OffsetOutOfRangeException on DStreams union from separate Kafka 
> clusters with identical topic names.
> --
>
> Key: SPARK-17853
> URL: https://issues.apache.org/jira/browse/SPARK-17853
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcin Kuthan
>
> During migration from Spark 1.6 to 2.0 I observed OffsetOutOfRangeException  
> reported by Kafka client. In our scenario we create single DStream as a union 
> of multiple DStreams. One DStream for one Kafka cluster (multi dc solution). 
> Both Kafka clusters have the same topics and number of partitions.
> After quick investigation, I found that class DirectKafkaInputDStream keeps 
> offset state for topic and partitions, but it is not aware of different Kafka 
> clusters. 
> For every topic, single DStream is created as a union from all configured 
> Kafka clusters.
> {code}
> class KafkaDStreamSource(configs: Iterable[Map[String, String]]) {
> def createSource(ssc: StreamingContext, topic: String): DStream[(String, 
> Array[Byte])] = {
> val streams = configs.map { config =>
>   val kafkaParams = config
>   val kafkaTopics = Set(topic)
>   KafkaUtils.
>   createDirectStream[String, Array[Byte]](
> ssc,
> LocationStrategies.PreferConsistent,
> ConsumerStrategies.Subscribe[String, Array[Byte]](kafkaTopics, 
> kafkaParams)
>   ).map { record =>
> (record.key, record.value)
>   }
> }
> ssc.union(streams.toSeq)
>   }
> }
> {code}
> At the end, offsets from one Kafka cluster overwrite offsets from second one. 
> Fortunately OffsetOutOfRangeException was thrown because offsets in both 
> Kafka clusters are significantly different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17853) Kafka OffsetOutOfRangeException on DStreams union from separate Kafka clusters with identical topic names.

2016-10-11 Thread Aleksander Ihnatowicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565448#comment-15565448
 ] 

Aleksander Ihnatowicz commented on SPARK-17853:
---

Setting different group ids solved the issue.

> Kafka OffsetOutOfRangeException on DStreams union from separate Kafka 
> clusters with identical topic names.
> --
>
> Key: SPARK-17853
> URL: https://issues.apache.org/jira/browse/SPARK-17853
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcin Kuthan
>
> During migration from Spark 1.6 to 2.0 I observed OffsetOutOfRangeException  
> reported by Kafka client. In our scenario we create single DStream as a union 
> of multiple DStreams. One DStream for one Kafka cluster (multi dc solution). 
> Both Kafka clusters have the same topics and number of partitions.
> After quick investigation, I found that class DirectKafkaInputDStream keeps 
> offset state for topic and partitions, but it is not aware of different Kafka 
> clusters. 
> For every topic, single DStream is created as a union from all configured 
> Kafka clusters.
> {code}
> class KafkaDStreamSource(configs: Iterable[Map[String, String]]) {
> def createSource(ssc: StreamingContext, topic: String): DStream[(String, 
> Array[Byte])] = {
> val streams = configs.map { config =>
>   val kafkaParams = config
>   val kafkaTopics = Set(topic)
>   KafkaUtils.
>   createDirectStream[String, Array[Byte]](
> ssc,
> LocationStrategies.PreferConsistent,
> ConsumerStrategies.Subscribe[String, Array[Byte]](kafkaTopics, 
> kafkaParams)
>   ).map { record =>
> (record.key, record.value)
>   }
> }
> ssc.union(streams.toSeq)
>   }
> }
> {code}
> At the end, offsets from one Kafka cluster overwrite offsets from second one. 
> Fortunately OffsetOutOfRangeException was thrown because offsets in both 
> Kafka clusters are significantly different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17808) BinaryType fails in Python 3 due to outdated Pyrolite

2016-10-11 Thread Pete Fein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565470#comment-15565470
 ] 

Pete Fein commented on SPARK-17808:
---

Any reason this can't be included in the next 2.0.x bug fix release? 

> BinaryType fails in Python 3 due to outdated Pyrolite
> -
>
> Key: SPARK-17808
> URL: https://issues.apache.org/jira/browse/SPARK-17808
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: spark-2.0.1-bin-hadoop2.7 with Python 3.4.3 on Ubuntu 
> 14.04.4 LTS
>Reporter: Pete Fein
>Assignee: Bryan Cutler
> Fix For: 2.1.0
>
> Attachments: demo.py, demo_output.txt
>
>
> Attempting to create a DataFrame using a BinaryType field fails under Python 
> 3 because the underlying Pyrolite library is out of date. Spark appears to be 
> using Pyrolite 4.9; this issue was fixed in Pyrolite 4.12. See [original bug 
> report|https://github.com/irmen/Pyrolite/issues/36] and 
> [patch|https://github.com/irmen/Pyrolite/commit/eec11786746d933b9d2c3eaeb1e1486319ae436e]
> Test case & output attached. I'm just a Python guy, not really sure how to 
> build Spark / do classpath magic to test if this works correctly with updated 
> Pyrolite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565493#comment-15565493
 ] 

Harish edited comment on SPARK-17463 at 10/11/16 2:03 PM:
--

It looks like a show stopper for my current project. Can you please let me know 
the 2.1.0 release date or do we have same issue in 1.6.0/1/2 ? So that i can 
revert back to 1.6.


was (Author: harishk15):
It looks like a show stopper for my current project. Can you please let me know 
the 2.1.0 release date or do we have same issue in 1.6.0/1/2 ?

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.i

[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565493#comment-15565493
 ] 

Harish commented on SPARK-17463:


It looks like a show stopper for my current project. Can you please let me know 
the 2.1.0 release date or do we have same issue in 1.6.0/1/2 ?

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.Objec

[jira] [Updated] (SPARK-17872) aggregate function on dataset with tuples grouped by non sequential fields

2016-10-11 Thread Niek Bartholomeus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niek Bartholomeus updated SPARK-17872:
--
Description: 
The following lines where the field index in the tuple used in an aggregate 
function is lower than a field index used in the group by clause fails:
{code}
val testDS = Seq((1, 1, 1, 1)).toDS

// group by field 1 and 3, aggregate on field 2 and 4:
testDS
.groupByKey { case (level1, level1FigureA, level2, level2FigureB) => 
(level1, level2) }
.agg((sum($"_2" * $"_4")).as[Double])
.collect
{code}

Error message:
{code}
org.apache.spark.sql.AnalysisException: Reference '_2' is ambiguous, could be: 
_2#562, _2#569.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:148)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5$$anonfun$31.apply(Analyzer.scala:604)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5$$anonfun$31.apply(Analyzer.scala:604)
  at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:604)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:600)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
{code}

While the following code - where the aggregate field indices are all higher 
than the groupby field indices - works fine:
{code}
testDS
.map { case (level1, level1FigureA, level2, level2FigureB) => (level1, 
level2, level1FigureA, level2FigureB) }
.groupByKey { case  (level1, level2, level1FigureA, level2FigureB) => 
(level1, level2) }
.agg((sum($"_3" * $"_4")).as[Double])
.collect
{code}

  was:
The following lines where the field index in the tuple used in an aggregate 
function is lower than a field index used in the group by clause fails:
{code}
val testDS = Seq((1, 1, 1, 1)).toDS

// group by field one and three, aggregate on field 2:
testDS
.groupByKey { case (level1, level1FigureA, level2, level2FigureB) => 
(level1, level2) }
.agg((sum($"_2" * $"_4")).as[Double])
.collect
{code}

Error message:
{code}
org.apache.spark.sql.AnalysisException: Reference '_2' is ambiguous, could be: 
_2#562, _2#569.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:148)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5$$anonfun$31.apply(Analyzer.scala:604)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5$$anonfun$31.apply(Analyzer.scala:604)
  at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:604)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:600)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
{code}

While the following code - where the aggregate field indices are all higher 
than the groupby field indices - works fine:
{code}
testDS
.map { case (level1, level1FigureA, level2, level2FigureB) => (level1, 
level2, level1FigureA, level2FigureB) }
.groupByKey { case  (level1, level2, level1FigureA, level2FigureB) => 
(level1, level2) }
.agg((sum($"_3" * $"_4")).as[Double])
.collect
{code}


> aggregate function on dataset with tuples grouped by non sequential fields
> --
>
> Key: SPARK-17872
> URL: https://issues.apache.org/jira/browse/SPARK-17872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Niek Bartholomeus
>
> The following lines where the field index in the tuple used in an aggregate 
> function is lower than a field index used in the group by

[jira] [Created] (SPARK-17872) aggregate function on dataset with tuples grouped by non sequential fields

2016-10-11 Thread Niek Bartholomeus (JIRA)
Niek Bartholomeus created SPARK-17872:
-

 Summary: aggregate function on dataset with tuples grouped by non 
sequential fields
 Key: SPARK-17872
 URL: https://issues.apache.org/jira/browse/SPARK-17872
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Niek Bartholomeus


The following lines where the field index in the tuple used in an aggregate 
function is lower than a field index used in the group by clause fails:
{code}
val testDS = Seq((1, 1, 1, 1)).toDS

// group by field one and three, aggregate on field 2:
testDS
.groupByKey { case (level1, level1FigureA, level2, level2FigureB) => 
(level1, level2) }
.agg((sum($"_2" * $"_4")).as[Double])
.collect
{code}

Error message:
{code}
org.apache.spark.sql.AnalysisException: Reference '_2' is ambiguous, could be: 
_2#562, _2#569.;
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:148)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5$$anonfun$31.apply(Analyzer.scala:604)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5$$anonfun$31.apply(Analyzer.scala:604)
  at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:604)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:600)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
{code}

While the following code - where the aggregate field indices are all higher 
than the groupby field indices - works fine:
{code}
testDS
.map { case (level1, level1FigureA, level2, level2FigureB) => (level1, 
level2, level1FigureA, level2FigureB) }
.groupByKey { case  (level1, level2, level1FigureA, level2FigureB) => 
(level1, level2) }
.agg((sum($"_3" * $"_4")).as[Double])
.collect
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17853) Kafka OffsetOutOfRangeException on DStreams union from separate Kafka clusters with identical topic names.

2016-10-11 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565559#comment-15565559
 ] 

Cody Koeninger commented on SPARK-17853:


Good, will keep this ticket open at least until documentation is made
clearer.

On Oct 11, 2016 8:48 AM, "Aleksander Ihnatowicz (JIRA)" 



> Kafka OffsetOutOfRangeException on DStreams union from separate Kafka 
> clusters with identical topic names.
> --
>
> Key: SPARK-17853
> URL: https://issues.apache.org/jira/browse/SPARK-17853
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcin Kuthan
>
> During migration from Spark 1.6 to 2.0 I observed OffsetOutOfRangeException  
> reported by Kafka client. In our scenario we create single DStream as a union 
> of multiple DStreams. One DStream for one Kafka cluster (multi dc solution). 
> Both Kafka clusters have the same topics and number of partitions.
> After quick investigation, I found that class DirectKafkaInputDStream keeps 
> offset state for topic and partitions, but it is not aware of different Kafka 
> clusters. 
> For every topic, single DStream is created as a union from all configured 
> Kafka clusters.
> {code}
> class KafkaDStreamSource(configs: Iterable[Map[String, String]]) {
> def createSource(ssc: StreamingContext, topic: String): DStream[(String, 
> Array[Byte])] = {
> val streams = configs.map { config =>
>   val kafkaParams = config
>   val kafkaTopics = Set(topic)
>   KafkaUtils.
>   createDirectStream[String, Array[Byte]](
> ssc,
> LocationStrategies.PreferConsistent,
> ConsumerStrategies.Subscribe[String, Array[Byte]](kafkaTopics, 
> kafkaParams)
>   ).map { record =>
> (record.key, record.value)
>   }
> }
> ssc.union(streams.toSeq)
>   }
> }
> {code}
> At the end, offsets from one Kafka cluster overwrite offsets from second one. 
> Fortunately OffsetOutOfRangeException was thrown because offsets in both 
> Kafka clusters are significantly different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17873) ALTER TABLE ... RENAME TO ... should allow users to specify database in destination table name

2016-10-11 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-17873:
---

 Summary: ALTER TABLE ... RENAME TO ... should allow users to 
specify database in destination table name
 Key: SPARK-17873
 URL: https://issues.apache.org/jira/browse/SPARK-17873
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-11 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565592#comment-15565592
 ] 

Don Drake commented on SPARK-16845:
---

Unfortunately, it does not work around it.


16/10/10 18:19:47 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificUnsafeProjection(references);
/* 003 */ }

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17873) ALTER TABLE ... RENAME TO ... should allow users to specify database in destination table name

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565623#comment-15565623
 ] 

Apache Spark commented on SPARK-17873:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15434

> ALTER TABLE ... RENAME TO ... should allow users to specify database in 
> destination table name
> --
>
> Key: SPARK-17873
> URL: https://issues.apache.org/jira/browse/SPARK-17873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17873) ALTER TABLE ... RENAME TO ... should allow users to specify database in destination table name

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17873:


Assignee: Apache Spark  (was: Wenchen Fan)

> ALTER TABLE ... RENAME TO ... should allow users to specify database in 
> destination table name
> --
>
> Key: SPARK-17873
> URL: https://issues.apache.org/jira/browse/SPARK-17873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Meng updated SPARK-17870:
--
Summary: ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is 
wrong   (was: ML/MLLIB: Statistics.chiSqTest(RDD) is wrong )

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8425) Add blacklist mechanism for task scheduling

2016-10-11 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-8425:

Attachment: DesignDocforBlacklistMechanism.pdf

Seems like there is agreement on the design, so I'm attaching a snapshot of the 
design doc.  (Original google doc here: 
https://docs.google.com/document/d/1R2CVKctUZG9xwD67jkRdhBR4sCgccPR2dhTYSRXFEmg/edit)

> Add blacklist mechanism for task scheduling
> ---
>
> Key: SPARK-8425
> URL: https://issues.apache.org/jira/browse/SPARK-8425
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Assignee: Imran Rashid
>Priority: Minor
> Attachments: DesignDocforBlacklistMechanism.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-10-11 Thread Artur Sukhenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artur Sukhenko updated SPARK-4105:
--
Affects Version/s: 2.0.0

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's another occurrence of a similar error:
> {code}
> java.io.

[jira] [Assigned] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17139:


Assignee: Apache Spark

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17139:


Assignee: (was: Apache Spark)

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565973#comment-15565973
 ] 

Apache Spark commented on SPARK-17139:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/15435

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-10-11 Thread Jo Desmet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566019#comment-15566019
 ] 

Jo Desmet commented on SPARK-15343:
---

By design we apparently have a very tight coupling of scheduling and execution 
environment(?). Unless this gets better decoupling with a remote-call API, it 
is as much YARN's as it is SPARK's classpath. Isn't that what Jersey is about? 
What I advocate is a more subdued approach where we keep supporting new 
versions of Spark (2.0 and beyond) that remain compatible with mainstream used 
versions of Hadoop Yarn. We should do so until we have more mainstream adoption 
of a YARN environment with the more recent libraries, or until other fixes or 
features are implemented on either Spark or Yarn side. The desire for a newer 
Jersey library just does not seem that much worth to me compared to this.
I am no Yarn fan, but it just feels like we are breaking the bond with Yarn 
just because we feel it is not going fast enough on that side.


> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.ref

[jira] [Updated] (SPARK-17858) Provide option for Spark SQL to skip corrupt files

2016-10-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17858:
-
Description: 
In Spark 2.0, corrupt files will fail a SQL query. However, the user may just 
want to skip corrupt files and still run the query.

Another painful thing is the current exception doesn't contain the paths of 
corrupt files, makes the user hard to fix their files. It's better to include 
the paths in the error message.

Note: In Spark 1.6, Spark SQL always skip corrupt files because of SPARK-17850.

  was:
In Spark 2.0, corrupt files will fail a SQL query. However, the user may just 
want to skip corrupt files and still run the query.

Another painful thing is the current exception doesn't contain the paths of 
corrupt files, makes the user hard to fix their files.

Note: In Spark 1.6, Spark SQL always skip corrupt files because of SPARK-17850.


> Provide option for Spark SQL to skip corrupt files
> --
>
> Key: SPARK-17858
> URL: https://issues.apache.org/jira/browse/SPARK-17858
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>
> In Spark 2.0, corrupt files will fail a SQL query. However, the user may just 
> want to skip corrupt files and still run the query.
> Another painful thing is the current exception doesn't contain the paths of 
> corrupt files, makes the user hard to fix their files. It's better to include 
> the paths in the error message.
> Note: In Spark 1.6, Spark SQL always skip corrupt files because of 
> SPARK-17850.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17874) Additional SSL port on HistoryServer should be configurable

2016-10-11 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-17874:
---
Summary: Additional SSL port on HistoryServer should be configurable  (was: 
Enabling SSL on HistoryServer should only open one port not two)

> Additional SSL port on HistoryServer should be configurable
> ---
>
> Key: SPARK-17874
> URL: https://issues.apache.org/jira/browse/SPARK-17874
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.1
>Reporter: Andrew Ash
>
> When turning on SSL on the HistoryServer with 
> {{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the 
> [hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262]
>  result of calculating {{spark.history.ui.port + 400}}, and sets up a 
> redirect from the original (http) port to the new (https) port.
> {noformat}
> $ netstat -nlp | grep 23714
> (Not all processes could be identified, non-owned process info
>  will not be shown, you would have to be root to see it all.)
> tcp0  0 :::18080:::*
> LISTEN  23714/java
> tcp0  0 :::18480:::*
> LISTEN  23714/java
> {noformat}
> By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one 
> open port to change protocol from http to https, not to have 1) additional 
> ports open 2) the http port remain open 3) the additional port at a value I 
> didn't specify.
> To fix this could take one of two approaches:
> Approach 1:
> - one port always, which is configured with {{spark.history.ui.port}}
> - the protocol on that port is http by default
> - or if {{spark.ssl.historyServer.enabled=true}} then it's https
> Approach 2:
> - add a new configuration item {{spark.history.ui.sslPort}} which configures 
> the second port that starts up
> In approach 1 we probably need a way to specify to Spark jobs whether the 
> history server has ssl or not, based on SPARK-16988
> That makes me think we should go with approach 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17874) Enabling SSL on HistoryServer should only open one port not two

2016-10-11 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-17874:
--

 Summary: Enabling SSL on HistoryServer should only open one port 
not two
 Key: SPARK-17874
 URL: https://issues.apache.org/jira/browse/SPARK-17874
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.0.1
Reporter: Andrew Ash


When turning on SSL on the HistoryServer with 
{{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the 
[hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262]
 result of calculating {{spark.history.ui.port + 400}}, and sets up a redirect 
from the original (http) port to the new (https) port.

{noformat}
$ netstat -nlp | grep 23714
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp0  0 :::18080:::*
LISTEN  23714/java
tcp0  0 :::18480:::*
LISTEN  23714/java
{noformat}

By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one 
open port to change protocol from http to https, not to have 1) additional 
ports open 2) the http port remain open 3) the additional port at a value I 
didn't specify.

To fix this could take one of two approaches:

Approach 1:
- one port always, which is configured with {{spark.history.ui.port}}
- the protocol on that port is http by default
- or if {{spark.ssl.historyServer.enabled=true}} then it's https

Approach 2:
- add a new configuration item {{spark.history.ui.sslPort}} which configures 
the second port that starts up

In approach 1 we probably need a way to specify to Spark jobs whether the 
history server has ssl or not, based on SPARK-16988

That makes me think we should go with approach 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566104#comment-15566104
 ] 

Sean Owen commented on SPARK-17463:
---

What do you mean? this has been released already in 2.0.1.

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectO

[jira] [Commented] (SPARK-17808) BinaryType fails in Python 3 due to outdated Pyrolite

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566107#comment-15566107
 ] 

Sean Owen commented on SPARK-17808:
---

I think it could be OK. It's a bug fix, and while it is a minor version bump to 
a dependency in a maintenance release, it looks like a small change.

> BinaryType fails in Python 3 due to outdated Pyrolite
> -
>
> Key: SPARK-17808
> URL: https://issues.apache.org/jira/browse/SPARK-17808
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: spark-2.0.1-bin-hadoop2.7 with Python 3.4.3 on Ubuntu 
> 14.04.4 LTS
>Reporter: Pete Fein
>Assignee: Bryan Cutler
> Fix For: 2.1.0
>
> Attachments: demo.py, demo_output.txt
>
>
> Attempting to create a DataFrame using a BinaryType field fails under Python 
> 3 because the underlying Pyrolite library is out of date. Spark appears to be 
> using Pyrolite 4.9; this issue was fixed in Pyrolite 4.12. See [original bug 
> report|https://github.com/irmen/Pyrolite/issues/36] and 
> [patch|https://github.com/irmen/Pyrolite/commit/eec11786746d933b9d2c3eaeb1e1486319ae436e]
> Test case & output attached. I'm just a Python guy, not really sure how to 
> build Spark / do classpath magic to test if this works correctly with updated 
> Pyrolite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-11 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566114#comment-15566114
 ] 

Ashish Shrowty commented on SPARK-17709:


I assume I would need to modify the Spark code and build Spark libraries 
locally? Haven't done that before, but willing to try. Is there some docs/links 
you can point me to that show the best way to go about this?

Thanks

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-11 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566124#comment-15566124
 ] 

Xiao Li commented on SPARK-17709:
-

Below is the link: http://spark.apache.org/docs/latest/building-spark.html

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-11 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566126#comment-15566126
 ] 

Xiao Li commented on SPARK-17709:
-

Below is the link: http://spark.apache.org/docs/latest/building-spark.html

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x

2016-10-11 Thread Sean Owen (JIRA)
Sean Owen created SPARK-17875:
-

 Summary: Remove unneeded direct dependence on Netty 3.x
 Key: SPARK-17875
 URL: https://issues.apache.org/jira/browse/SPARK-17875
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.0.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Trivial


The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is 
used. It's best to remove the 3.x dependency (and while we're at it, update a 
few things like license info)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17875:


Assignee: Sean Owen  (was: Apache Spark)

> Remove unneeded direct dependence on Netty 3.x
> --
>
> Key: SPARK-17875
> URL: https://issues.apache.org/jira/browse/SPARK-17875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
>
> The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is 
> used. It's best to remove the 3.x dependency (and while we're at it, update a 
> few things like license info)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566136#comment-15566136
 ] 

Apache Spark commented on SPARK-17875:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15436

> Remove unneeded direct dependence on Netty 3.x
> --
>
> Key: SPARK-17875
> URL: https://issues.apache.org/jira/browse/SPARK-17875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
>
> The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is 
> used. It's best to remove the 3.x dependency (and while we're at it, update a 
> few things like license info)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17875:


Assignee: Apache Spark  (was: Sean Owen)

> Remove unneeded direct dependence on Netty 3.x
> --
>
> Key: SPARK-17875
> URL: https://issues.apache.org/jira/browse/SPARK-17875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Trivial
>
> The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is 
> used. It's best to remove the 3.x dependency (and while we're at it, update a 
> few things like license info)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17811) SparkR cannot parallelize data.frame with NA or NULL in Date columns

2016-10-11 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566163#comment-15566163
 ] 

Miao Wang commented on SPARK-17811:
---

:) Just want to submit a PR and found that you have a fix. Good to learn more 
about R. Thanks!

> SparkR cannot parallelize data.frame with NA or NULL in Date columns
> 
>
> Key: SPARK-17811
> URL: https://issues.apache.org/jira/browse/SPARK-17811
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> To reproduce: 
> {code}
> df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id = 
> 1:12)
> dim(createDataFrame(df))
> {code}
> We don't seem to have this problem with POSIXlt 
> {code}
> df <- data.frame(Date = as.POSIXlt(as.Date(c(rep("2016-01-10", 10), "NA", 
> "NA"))), id = 1:12)
> dim(createDataFrame(df))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-11 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566160#comment-15566160
 ] 

Ashish Shrowty commented on SPARK-17709:


Cool.. thanks. Will do this in next day or two.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17876) Write StructuredStreaming WAL to a stream instead of materializing all at once

2016-10-11 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-17876:
---

 Summary: Write StructuredStreaming WAL to a stream instead of 
materializing all at once
 Key: SPARK-17876
 URL: https://issues.apache.org/jira/browse/SPARK-17876
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Affects Versions: 2.0.1, 2.0.0
Reporter: Burak Yavuz


The CompactibleFileStreamLog materializes the whole metadata log in memory as a 
String. This can cause issues when there are lots of files that are being 
committed, especially during a compaction batch. 

You may come across stacktraces that look like:
{code}
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at 
org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)
at 
{code}

The safer way is to write to an output stream so that we don't have to 
materialize a huge string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17876) Write StructuredStreaming WAL to a stream instead of materializing all at once

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17876:


Assignee: Apache Spark

> Write StructuredStreaming WAL to a stream instead of materializing all at once
> --
>
> Key: SPARK-17876
> URL: https://issues.apache.org/jira/browse/SPARK-17876
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> The CompactibleFileStreamLog materializes the whole metadata log in memory as 
> a String. This can cause issues when there are lots of files that are being 
> committed, especially during a compaction batch. 
> You may come across stacktraces that look like:
> {code}
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.lang.StringCoding.encode(StringCoding.java:350)
> at java.lang.String.getBytes(String.java:941)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)
> at 
> {code}
> The safer way is to write to an output stream so that we don't have to 
> materialize a huge string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17876) Write StructuredStreaming WAL to a stream instead of materializing all at once

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566181#comment-15566181
 ] 

Apache Spark commented on SPARK-17876:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15437

> Write StructuredStreaming WAL to a stream instead of materializing all at once
> --
>
> Key: SPARK-17876
> URL: https://issues.apache.org/jira/browse/SPARK-17876
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Burak Yavuz
>
> The CompactibleFileStreamLog materializes the whole metadata log in memory as 
> a String. This can cause issues when there are lots of files that are being 
> committed, especially during a compaction batch. 
> You may come across stacktraces that look like:
> {code}
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.lang.StringCoding.encode(StringCoding.java:350)
> at java.lang.String.getBytes(String.java:941)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)
> at 
> {code}
> The safer way is to write to an output stream so that we don't have to 
> materialize a huge string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17876) Write StructuredStreaming WAL to a stream instead of materializing all at once

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17876:


Assignee: (was: Apache Spark)

> Write StructuredStreaming WAL to a stream instead of materializing all at once
> --
>
> Key: SPARK-17876
> URL: https://issues.apache.org/jira/browse/SPARK-17876
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Burak Yavuz
>
> The CompactibleFileStreamLog materializes the whole metadata log in memory as 
> a String. This can cause issues when there are lots of files that are being 
> committed, especially during a compaction batch. 
> You may come across stacktraces that look like:
> {code}
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.lang.StringCoding.encode(StringCoding.java:350)
> at java.lang.String.getBytes(String.java:941)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)
> at 
> {code}
> The safer way is to write to an output stream so that we don't have to 
> materialize a huge string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566191#comment-15566191
 ] 

Michael Armbrust commented on SPARK-17344:
--

I think the fact that CDH is still distributing 0.9 is a pretty convincing 
argument.

I'm also not convinced its a bad idea to and speak the protocol directly.  Our 
use case ends up being significantly simpler than most other consumer 
implementations since we have the opportunity to do global coordination on the 
driver.  As such, we'd really only to correctly to handle two types of 
requests: 
[TopicMetadataRequest|https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-TopicMetadataRequest]
 and 
[FetchRequest|https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchRequest].
 - The variations here across versions are minimal for these messages.
 - We could avoid have N different artifacts for N versions of kafka
 - We could remove the complexity of caching consumers on executors (though 
still set preferred locations to encourage collocation).
 - We could avoid extra copies of the payload when going from the kafka library 
into tungsten.

I agree we shouldn't make this decision lightly, but looking at our past 
experience supporting multiple versions of Hadoop/Hive as transparently as 
possible, I think this could be a big boost for adoption.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566206#comment-15566206
 ] 

Harish commented on SPARK-17463:


Is this fix is part of the https://github.com/apache/spark/pull/15371 pull 
request?. I have 2.0.1 in my cluster but i am getting both the errors.

16/10/11 00:53:42 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(2,[Lscala.Tuple2;@43f45f95,BlockManagerId(2, HOST, 43256))] in 1 
attempts
org.apache.spark.SparkException: Exception thrown in awaitResult
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList.writeObject(ArrayList.java:766)
at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at 
java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:14

[jira] [Resolved] (SPARK-17817) PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes

2016-10-11 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-17817.
--
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.1.0

> PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
> ---
>
> Key: SPARK-17817
> URL: https://issues.apache.org/jira/browse/SPARK-17817
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Mike Dusenberry
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> Calling {{repartition}} on a PySpark RDD to increase the number of partitions 
> results in highly skewed partition sizes, with most having 0 rows.  The 
> {{repartition}} method should evenly spread out the rows across the 
> partitions, and this behavior is correctly seen on the Scala side.
> Please reference the following code for a reproducible example of this issue:
> {code}
> # Python
> num_partitions = 2
> a = sc.parallelize(range(int(1e6)), 2)  # start with 2 even partitions
> l = a.repartition(num_partitions).glom().map(len).collect()  # get length of 
> each partition
> min(l), max(l), sum(l)/len(l), len(l)  # skewed!
> # Scala
> val numPartitions = 2
> val a = sc.parallelize(0 until 1e6.toInt, 2)  # start with 2 even partitions
> val l = a.repartition(numPartitions).glom().map(_.length).collect()  # get 
> length of each partition
> print(l.min, l.max, l.sum/l.length, l.length)  # even!
> {code}
> The issue here is that highly skewed partitions can result in severe memory 
> pressure in subsequent steps of a processing pipeline, resulting in OOM 
> errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566244#comment-15566244
 ] 

Cody Koeninger commented on SPARK-17344:


How long would it take CDH to distribute 0.10 if there was a compelling Spark 
client for it?

How are you going to handle SSL?

You can't avoid the complexity of caching consumers if you still want the 
benefits of prefetching, and doing an SSL handshake for every batch will kill 
performance if they aren't cached.

Also note that this is a pretty prime example of what I'm talking about in my 
dev mailing list discussion on SIPs.  This issue has been brought up, and 
decided against continuing support of 0.8, multiple times.

You guys started making promises about structured streaming for Kafka over half 
a year ago, and still don't have it feature complete.  This is a big potential 
detour for uncertain gain.  The real underlying problem is still how you're 
going to do better than simply wrapping a DStream, and I don't see how this is 
directly relevant.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type

2016-10-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15153:
--
Target Version/s: 2.1.0

> SparkR spark.naiveBayes throws error when label is numeric type
> ---
>
> Key: SPARK-15153
> URL: https://issues.apache.org/jira/browse/SPARK-15153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> When the label of dataset is numeric type, SparkR spark.naiveBayes will throw 
> error. This bug is easy to reproduce:
> {code}
> t <- as.data.frame(Titanic)
> t1 <- t[t$Freq > 0, -5]
> t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1)
> t2 <- t1[-4]
> df <- suppressWarnings(createDataFrame(sqlContext, t2))
> m <- spark.naiveBayes(df, NumericSurvived ~ .)
> 16/05/05 03:26:17 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.ClassCastException: 
> org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to 
> org.apache.spark.ml.attribute.NominalAttribute
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at io.netty.channel.AbstractChannelHandlerContext.invo
> {code}
> In RFormula, the response variable type could be string or numeric. If it's 
> string, RFormula will transform it to label of DoubleType by StringIndexer 
> and set corresponding column metadata; otherwise, RFormula will directly use 
> it as label when training model (and assumes that it was numbered from 0, 
> ..., maxLabelIndex). 
> When we extract labels at ml.r.NaiveBayesWrapper, we should handle it 
> according the type of the response variable (string or numeric).
> cc [~mengxr] [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type

2016-10-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566251#comment-15566251
 ] 

Joseph K. Bradley commented on SPARK-15153:
---

Note I'm setting the target version for 2.1, not 2.0.x, since the fix requires 
a public API change in the preceding PR.

> SparkR spark.naiveBayes throws error when label is numeric type
> ---
>
> Key: SPARK-15153
> URL: https://issues.apache.org/jira/browse/SPARK-15153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> When the label of dataset is numeric type, SparkR spark.naiveBayes will throw 
> error. This bug is easy to reproduce:
> {code}
> t <- as.data.frame(Titanic)
> t1 <- t[t$Freq > 0, -5]
> t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1)
> t2 <- t1[-4]
> df <- suppressWarnings(createDataFrame(sqlContext, t2))
> m <- spark.naiveBayes(df, NumericSurvived ~ .)
> 16/05/05 03:26:17 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.ClassCastException: 
> org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to 
> org.apache.spark.ml.attribute.NominalAttribute
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at io.netty.channel.AbstractChannelHandlerContext.invo
> {code}
> In RFormula, the response variable type could be string or numeric. If it's 
> string, RFormula will transform it to label of DoubleType by StringIndexer 
> and set corresponding column metadata; otherwise, RFormula will directly use 
> it as label when training model (and assumes that it was numbered from 0, 
> ..., maxLabelIndex). 
> When we extract labels at ml.r.NaiveBayesWrapper, we should handle it 
> according the type of the response variable (string or numeric).
> cc [~mengxr] [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566284#comment-15566284
 ] 

Sean Owen commented on SPARK-17870:
---

OK I get it, they're doing different things really. The scikit version is 
computing the statistic for count-valued features vs categorical label, and the 
Spark version is computing this for categorical features vs categorical labels. 
Although the number of label classes is constant in both cases, the Spark 
computation would depend on the number of feature classes too. Yes, it does 
need to be changed in Spark.

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566299#comment-15566299
 ] 

Sean Owen commented on SPARK-17463:
---

No, that change came after, and is part of a different JIRA that addresses 
another part of the same problem. It is not in 2.0.1

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputSt

[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566300#comment-15566300
 ] 

Sean Owen commented on SPARK-17463:
---

No, that change came after, and is part of a different JIRA that addresses 
another part of the same problem. It is not in 2.0.1

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputSt

[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2016-10-11 Thread Jerome Scheuring (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566349#comment-15566349
 ] 

Jerome Scheuring commented on SPARK-12216:
--

_Note that I am entirely new to the process of submitting issues on this 
system: if this needs to be a new issue, I would appreciate someone letting me 
know._

A bug very similar to this one is 100% reproducible across multiple machines, 
running both Windows 8.1 and Windows 10.

It occurs

* in Scala, but not Python (have not tried R)
* only when reading CSV files (and not, for example, when reading Parquet files)
* only when running local, not submitted to a cluster

This program will produce the bug (if {{poemData}} is defined per the 
commented-out section, rather than being read from a CSV file, the bug does not 
occur):

{code}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object SparkBugDemo {
  def main(args: Array[String]): Unit = {

val poemSchema = StructType(
  Seq(
StructField("label",IntegerType), 
StructField("line",StringType)
  )
)

val sparkSession = SparkSession.builder()
  .appName("Spark Bug Demonstration")
  .master("local[*]")
  .getOrCreate()

//val poemData = sparkSession.createDataFrame(Seq(
//  (0, "There's many a strong farmer"),
//  (0, "Who's heart would break in two"),
//  (1, "If he could see the townland"),
//  (1, "That we are riding to;")
//)).toDF("label", "line")

val poemData = sparkSession.read
  .option("quote", value="")
  .schema(poemSchema)
  .csv(args(0))

println(s"Record count: ${poemData.count()}")

  }
}
{code}

Assuming that {{args(0)}} contains the path to a file with comma-separated 
integer/string pairs, as in:

{noformat}
0,There's many a strong farmer
0,Who's heart would break in two
1,If he could see the townland
1,That we are riding to;
{noformat}

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManag

[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2016-10-11 Thread Jerome Scheuring (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566349#comment-15566349
 ] 

Jerome Scheuring edited comment on SPARK-12216 at 10/11/16 7:34 PM:


_Note that I am entirely new to the process of submitting issues on this 
system: if this needs to be a new issue, I would appreciate someone letting me 
know._

A bug very similar to this one is 100% reproducible across multiple machines, 
running both Windows 8.1 and Windows 10, compiled with Scala 2.11 and running 
under Spark 2.0.1.

It occurs

* in Scala, but not Python (have not tried R)
* only when reading CSV files (and not, for example, when reading Parquet files)
* only when running local, not submitted to a cluster

This program will produce the bug (if {{poemData}} is defined per the 
commented-out section, rather than being read from a CSV file, the bug does not 
occur):

{code}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object SparkBugDemo {
  def main(args: Array[String]): Unit = {

val poemSchema = StructType(
  Seq(
StructField("label",IntegerType), 
StructField("line",StringType)
  )
)

val sparkSession = SparkSession.builder()
  .appName("Spark Bug Demonstration")
  .master("local[*]")
  .getOrCreate()

//val poemData = sparkSession.createDataFrame(Seq(
//  (0, "There's many a strong farmer"),
//  (0, "Who's heart would break in two"),
//  (1, "If he could see the townland"),
//  (1, "That we are riding to;")
//)).toDF("label", "line")

val poemData = sparkSession.read
  .option("quote", value="")
  .schema(poemSchema)
  .csv(args(0))

println(s"Record count: ${poemData.count()}")

  }
}
{code}

Assuming that {{args(0)}} contains the path to a file with comma-separated 
integer/string pairs, as in:

{noformat}
0,There's many a strong farmer
0,Who's heart would break in two
1,If he could see the townland
1,That we are riding to;
{noformat}


was (Author: jerome.scheuring):
_Note that I am entirely new to the process of submitting issues on this 
system: if this needs to be a new issue, I would appreciate someone letting me 
know._

A bug very similar to this one is 100% reproducible across multiple machines, 
running both Windows 8.1 and Windows 10.

It occurs

* in Scala, but not Python (have not tried R)
* only when reading CSV files (and not, for example, when reading Parquet files)
* only when running local, not submitted to a cluster

This program will produce the bug (if {{poemData}} is defined per the 
commented-out section, rather than being read from a CSV file, the bug does not 
occur):

{code}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object SparkBugDemo {
  def main(args: Array[String]): Unit = {

val poemSchema = StructType(
  Seq(
StructField("label",IntegerType), 
StructField("line",StringType)
  )
)

val sparkSession = SparkSession.builder()
  .appName("Spark Bug Demonstration")
  .master("local[*]")
  .getOrCreate()

//val poemData = sparkSession.createDataFrame(Seq(
//  (0, "There's many a strong farmer"),
//  (0, "Who's heart would break in two"),
//  (1, "If he could see the townland"),
//  (1, "That we are riding to;")
//)).toDF("label", "line")

val poemData = sparkSession.read
  .option("quote", value="")
  .schema(poemSchema)
  .csv(args(0))

println(s"Record count: ${poemData.count()}")

  }
}
{code}

Assuming that {{args(0)}} contains the path to a file with comma-separated 
integer/string pairs, as in:

{noformat}
0,There's many a strong farmer
0,Who's heart would break in two
1,If he could see the townland
1,That we are riding to;
{noformat}

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Fai

[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov updated SPARK-17877:

Description: 
The following code demonstrates an issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false

  was:
The following code demonstrates an issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
```
I think the last line should return true instead of false


> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates an issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
> "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint
> val gg = g.connectedComponents()
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed
> // res5: Boolean = false
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)
Alexander Pivovarov created SPARK-17877:
---

 Summary: Can not checkpoint connectedComponents resulting graph
 Key: SPARK-17877
 URL: https://issues.apache.org/jira/browse/SPARK-17877
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.0.1, 1.6.2, 1.5.2
Reporter: Alexander Pivovarov
Priority: Minor


The following code demonstrates an issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
```
I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov updated SPARK-17877:

Description: 
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false

  was:
The following code demonstrates an issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false


> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
> "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint
> val gg = g.connectedComponents()
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed
> // res5: Boolean = false
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov updated SPARK-17877:

Description: 
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
-> "kelly"))
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false

  was:
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false


> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
> -> "kelly"))
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint
> val gg = g.connectedComponents
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed
> // res5: Boolean = false
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type

2016-10-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-15153.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15431
[https://github.com/apache/spark/pull/15431]

> SparkR spark.naiveBayes throws error when label is numeric type
> ---
>
> Key: SPARK-15153
> URL: https://issues.apache.org/jira/browse/SPARK-15153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
>
> When the label of dataset is numeric type, SparkR spark.naiveBayes will throw 
> error. This bug is easy to reproduce:
> {code}
> t <- as.data.frame(Titanic)
> t1 <- t[t$Freq > 0, -5]
> t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1)
> t2 <- t1[-4]
> df <- suppressWarnings(createDataFrame(sqlContext, t2))
> m <- spark.naiveBayes(df, NumericSurvived ~ .)
> 16/05/05 03:26:17 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.ClassCastException: 
> org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to 
> org.apache.spark.ml.attribute.NominalAttribute
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at io.netty.channel.AbstractChannelHandlerContext.invo
> {code}
> In RFormula, the response variable type could be string or numeric. If it's 
> string, RFormula will transform it to label of DoubleType by StringIndexer 
> and set corresponding column metadata; otherwise, RFormula will directly use 
> it as label when training model (and assumes that it was numbered from 0, 
> ..., maxLabelIndex). 
> When we extract labels at ml.r.NaiveBayesWrapper, we should handle it 
> according the type of the response variable (string or numeric).
> cc [~mengxr] [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov updated SPARK-17877:

Description: 
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
-> "kelly"))
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed  // res5: Boolean = false
{code}
I think the last line should return true instead of false

  was:
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
-> "kelly"))
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false


> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
> -> "kelly"))
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint
> val gg = g.connectedComponents
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed  // res5: Boolean = false
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-11 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566380#comment-15566380
 ] 

Timothy Hunter commented on SPARK-17845:


I like the {{Window.rowsBetween(Long.MinValue, -3)}} syntax, but it is exposing 
a system implementation detail. How about having some static/singleton values 
that define our notion of plus/minus infinity instead of relying on the system 
values?

Here is a suggestion:

{code}
Window.rowsBetween(Window.unboundedBefore, -3)

object Window {
  def unboundedBefore: Long = Int.MinVal.toLong
}
{code}

To get around that various sizes of the ints for each language, I suggest we 
say that every value above 2^31 is considered unbounded above. That should be 
more than enough and cover at least python, scala, R, java.


> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566385#comment-15566385
 ] 

Alexander Pivovarov commented on SPARK-17877:
-

Another open issue with checkpointing is SPARK-14804

> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
> -> "kelly"))
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint
> val gg = g.connectedComponents
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed  // res5: Boolean = false
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17812) More granular control of starting offsets

2016-10-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17812:
-
Description: 
Right now you can only run a Streaming Query starting from either the earliest 
or latests offsets available at the moment the query is started.  Sometimes 
this is a lot of data.  It would be nice to be able to do the following:
 - seek to user specified offsets for manually specified topicpartitions

  was:
Right now you can only run a Streaming Query starting from either the earliest 
or latests offsets available at the moment the query is started.  Sometimes 
this is a lot of data.  It would be nice to be able to do the following:
 - seek back {{X}} offsets in the stream from the moment the query starts
 - seek to user specified offsets


> More granular control of starting offsets
> -
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets

2016-10-11 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566392#comment-15566392
 ] 

Michael Armbrust commented on SPARK-17812:
--

For the seeking back {{X}} offsets use case, I was interactively querying a 
stream and I wanted *some* data, but not *all available data*.  I did not have 
specific offsets in mind, and under the assumption that items are getting 
hashed across partitions, X offsets back is a very reasonable proxy for time.  
I agree actual time would be better.  However, since there is disagreement on 
this case, I'd propose we break that out into its own ticket and focus on 
assign here.

I'm not sure I understand the concern with the {{startingOffsets}} option 
naming (which we can still change, though, it would be nice to do so before a 
release happens).  It affects which offsets will be included in the query and 
it only takes affect when the query is first started.  [~ofirm], currently we 
support  (1) (though I wouldn't say *all* data as we are limited by retention / 
compaction) and (2).  As you said, we can also support (3), though this must be 
done after the fact by adding a predicate to the stream on the timestamp 
column.  For performance it would be nice to push that down into Kafaka, but 
I'd split that optimization into another ticket.

Regarding (4), I like the proposed JSON solution.  It would be nice if this was 
unified with whatever format we decide to use in [SPARK-17829] so that you 
could easily pick up where another query left off.  I'd also suggest we use 
{{-1}} and {{-2}} as special offsets for subscribing to a topicpartition at the 
earliest or latests available offsets at query start time.

> More granular control of starting offsets
> -
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek back {{X}} offsets in the stream from the moment the query starts
>  - seek to user specified offsets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >