[jira] [Assigned] (SPARK-14745) CEP support in Spark Streaming

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14745:


Assignee: (was: Apache Spark)

> CEP support in Spark Streaming
> --
>
> Key: SPARK-14745
> URL: https://issues.apache.org/jira/browse/SPARK-14745
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Mario Briggs
> Attachments: SparkStreamingCEP.pdf
>
>
> Complex Event Processing is a often used feature in Streaming applications. 
> Spark Streaming current does not have a DSL/API for it. This JIRA is about 
> how/what can we add in Spark Streaming to support CEP out of the box



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14745) CEP support in Spark Streaming

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14745:


Assignee: Apache Spark

> CEP support in Spark Streaming
> --
>
> Key: SPARK-14745
> URL: https://issues.apache.org/jira/browse/SPARK-14745
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Mario Briggs
>Assignee: Apache Spark
> Attachments: SparkStreamingCEP.pdf
>
>
> Complex Event Processing is a often used feature in Streaming applications. 
> Spark Streaming current does not have a DSL/API for it. This JIRA is about 
> how/what can we add in Spark Streaming to support CEP out of the box



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14745) CEP support in Spark Streaming

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249289#comment-15249289
 ] 

Apache Spark commented on SPARK-14745:
--

User 'agsachin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12518

> CEP support in Spark Streaming
> --
>
> Key: SPARK-14745
> URL: https://issues.apache.org/jira/browse/SPARK-14745
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Mario Briggs
> Attachments: SparkStreamingCEP.pdf
>
>
> Complex Event Processing is a often used feature in Streaming applications. 
> Spark Streaming current does not have a DSL/API for it. This JIRA is about 
> how/what can we add in Spark Streaming to support CEP out of the box



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14639) Add `bround` function in Python/R.

2016-04-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14639.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12509
[https://github.com/apache/spark/pull/12509]

> Add `bround` function in Python/R.
> --
>
> Key: SPARK-14639
> URL: https://issues.apache.org/jira/browse/SPARK-14639
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Reporter: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue aims to expose Scala `bround` function in Python/R API.
> `bround` function is implemented in SPARK-14614 by extending current `round` 
> function.
> We used the following semantics from 
> [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java].
> {code}
> public static double bround(double input, int scale) {
> if (Double.isNaN(input) || Double.isInfinite(input)) {
>   return input;
> }
> return BigDecimal.valueOf(input).setScale(scale, 
> RoundingMode.HALF_EVEN).doubleValue();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14639) Add `bround` function in Python/R.

2016-04-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14639:
---
Assignee: Dongjoon Hyun

> Add `bround` function in Python/R.
> --
>
> Key: SPARK-14639
> URL: https://issues.apache.org/jira/browse/SPARK-14639
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue aims to expose Scala `bround` function in Python/R API.
> `bround` function is implemented in SPARK-14614 by extending current `round` 
> function.
> We used the following semantics from 
> [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java].
> {code}
> public static double bround(double input, int scale) {
> if (Double.isNaN(input) || Double.isInfinite(input)) {
>   return input;
> }
> return BigDecimal.valueOf(input).setScale(scale, 
> RoundingMode.HALF_EVEN).doubleValue();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14745) CEP support in Spark Streaming

2016-04-19 Thread Mario Briggs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249261#comment-15249261
 ] 

Mario Briggs edited comment on SPARK-14745 at 4/20/16 5:18 AM:
---

Document with what is CEP, Examples, Features and possible API 


was (Author: mariobriggs):
Examples, Features and possible API

> CEP support in Spark Streaming
> --
>
> Key: SPARK-14745
> URL: https://issues.apache.org/jira/browse/SPARK-14745
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Mario Briggs
> Attachments: SparkStreamingCEP.pdf
>
>
> Complex Event Processing is a often used feature in Streaming applications. 
> Spark Streaming current does not have a DSL/API for it. This JIRA is about 
> how/what can we add in Spark Streaming to support CEP out of the box



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14745) CEP support in Spark Streaming

2016-04-19 Thread Mario Briggs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Briggs updated SPARK-14745:
-
Attachment: SparkStreamingCEP.pdf

Examples, Features and possible API

> CEP support in Spark Streaming
> --
>
> Key: SPARK-14745
> URL: https://issues.apache.org/jira/browse/SPARK-14745
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Mario Briggs
> Attachments: SparkStreamingCEP.pdf
>
>
> Complex Event Processing is a often used feature in Streaming applications. 
> Spark Streaming current does not have a DSL/API for it. This JIRA is about 
> how/what can we add in Spark Streaming to support CEP out of the box



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14745) CEP support in Spark Streaming

2016-04-19 Thread Mario Briggs (JIRA)
Mario Briggs created SPARK-14745:


 Summary: CEP support in Spark Streaming
 Key: SPARK-14745
 URL: https://issues.apache.org/jira/browse/SPARK-14745
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Mario Briggs


Complex Event Processing is a often used feature in Streaming applications. 
Spark Streaming current does not have a DSL/API for it. This JIRA is about 
how/what can we add in Spark Streaming to support CEP out of the box



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14600) Push predicates through Expand

2016-04-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14600.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12496
[https://github.com/apache/spark/pull/12496]

> Push predicates through Expand
> --
>
> Key: SPARK-14600
> URL: https://issues.apache.org/jira/browse/SPARK-14600
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> A grouping sets will be analyzed as Aggregate(Expand(Project)), the grouping 
> attributes came from Project, but have different meaning in Project (equal to 
> original grouping expression) and Expand (could be original grouping 
> expression or null), this does not make sense, because the attribute has 
> different result in different operator,
>  A better way could be  Aggregate(Expand()), then we need to  fix SQL 
> generation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14704) create accumulators in TaskMetrics

2016-04-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14704:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-14626

> create accumulators in TaskMetrics
> --
>
> Key: SPARK-14704
> URL: https://issues.apache.org/jira/browse/SPARK-14704
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14704) create accumulators in TaskMetrics

2016-04-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14704.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> create accumulators in TaskMetrics
> --
>
> Key: SPARK-14704
> URL: https://issues.apache.org/jira/browse/SPARK-14704
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13419) SubquerySuite should use checkAnswer rather than ScalaTest's assertResult

2016-04-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13419.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12269
[https://github.com/apache/spark/pull/12269]

> SubquerySuite should use checkAnswer rather than ScalaTest's assertResult
> -
>
> Key: SPARK-13419
> URL: https://issues.apache.org/jira/browse/SPARK-13419
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 2.0.0
>
>
> This is blocked by being able to generate SQL for subqueries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14744) Put examples packaging on a diet

2016-04-19 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-14744:
--

 Summary: Put examples packaging on a diet
 Key: SPARK-14744
 URL: https://issues.apache.org/jira/browse/SPARK-14744
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin
Priority: Minor


Currently the examples bring in a lot of external dependencies, ballooning the 
size of the Spark distribution packages.

I'd like to propose two things to slim down these dependencies:

- make all non-Spark, and also Spark Streaming, dependencies "provided". This 
means, especially for streaming connectors, that launching examples becomes 
more like launching real applications (where you need to figure out how to 
provide those dependencies, e.g. using {{--packages}}).

- audit examples and remove those that don't provide a lot of value. For 
example, HBase is working on full-featured Spark bindings, based on code that 
has already been in use for a while before being merged into HBase. The HBase 
example in Spark is very bare bones and, in comparison, not really useful and 
in fact a little misleading.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14743) Improve delegation token handling in secure clusters

2016-04-19 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-14743:
--

 Summary: Improve delegation token handling in secure clusters
 Key: SPARK-14743
 URL: https://issues.apache.org/jira/browse/SPARK-14743
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin


In a way, I'd consider this a parent bug of SPARK-7252.

Spark's current support for delegation tokens is a little all over the place:
- for HDFS, there's support for re-creating tokens if a principal and keytab 
are provided
- for HBase and Hive, Spark will fetch delegation tokens so that apps can work 
in cluster mode, but will not re-create them, so apps that need those will stop 
working after 7 days
- for anything else, Spark doesn't do anything. Lots of other services use 
delegation tokens, and supporting them as data sources in Spark becomes more 
complicated because of that. e.g., Kafka will (hopefully) soon support them.

It would be nice if Spark had consistent support for handling delegation tokens 
regardless of who needs them. I'd list these as the requirements:

- Spark to provide a generic interface for fetching delegation tokens. This 
would allow Spark's delegation token support to be extended using some plugin 
architecture (e.g. Java services), meaning Spark itself doesn't need to support 
every possible service out there.

This would be used to fetch tokens when launching apps in cluster mode, and 
when a principal and a keytab are provided to Spark.

- A way to manually update delegation tokens in Spark. For example, a new 
SparkContext API, or some configuration that tells Spark to monitor a file for 
changes and load tokens from said file.

This would allow external applications to manage tokens outside of Spark and be 
able to update a running Spark application (think, for example, a job sever 
like Oozie, or something like Hive-on-Spark which manages Spark apps running 
remotely).

- A way to notify running code that new delegation tokens have been loaded.

This may not be strictly necessary; it might be possible for code to detect 
that, e.g., by peeking into the UserGroupInformation structure. But an event 
sent to the listener bus would allow applications to react when new tokens are 
available (e.g., the Hive backend could re-create connections to the metastore 
server using the new tokens).


Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the 
mailing list recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS

2016-04-19 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14521:
-
Summary: StackOverflowError in Kryo when executing TPC-DS  (was: 
StackOverflowError in Kryo when executing TPC-DS Query27)

> StackOverflowError in Kryo when executing TPC-DS
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13905) Change signature of as.data.frame() to be consistent with the R base package

2016-04-19 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-13905.
---
   Resolution: Fixed
 Assignee: Sun Rui
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/11811

> Change signature of as.data.frame() to be consistent with the R base package
> 
>
> Key: SPARK-13905
> URL: https://issues.apache.org/jira/browse/SPARK-13905
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 2.0.0
>
>
> change the signature of as.data.frame() to be consistent with that in the R 
> base package to meet R user's convention, as documented at 
> http://www.inside-r.org/r-doc/base/as.data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8327) Ganglia failed to start while starting standalone on EC 2 spark with spark-ec2

2016-04-19 Thread Vladimir Vladimirov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Vladimirov closed SPARK-8327.
--
Resolution: Not A Problem

> Ganglia failed to start while starting standalone on EC 2 spark with 
> spark-ec2 
> ---
>
> Key: SPARK-8327
> URL: https://issues.apache.org/jira/browse/SPARK-8327
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.3.1
>Reporter: Vladimir Vladimirov
>Priority: Minor
>
> exception shown
> {code}
> [FAILED] Starting httpd: httpd: Syntax error on line 199 of 
> /etc/httpd/conf/httpd.conf: Cannot load modules/libphp-5.5.so into server: 
> /etc/httpd/modules/libphp-5.5.so: cannot open shared object file: No such 
> file or directory [FAILED] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-04-19 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-14742:


 Summary: Redirect spark-ec2 doc to new location
 Key: SPARK-14742
 URL: https://issues.apache.org/jira/browse/SPARK-14742
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, EC2
Reporter: Nicholas Chammas
Priority: Minor


See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453

We need to redirect this page

http://spark.apache.org/docs/latest/ec2-scripts.html

to this page

https://github.com/amplab/spark-ec2#readme



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8327) Ganglia failed to start while starting standalone on EC 2 spark with spark-ec2

2016-04-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249144#comment-15249144
 ] 

Nicholas Chammas commented on SPARK-8327:
-

[~vvladymyrov] - Is this still an issue? If so, I suggest migrating this issue 
to the spark-ec2 tracker: https://github.com/amplab/spark-ec2/issues

> Ganglia failed to start while starting standalone on EC 2 spark with 
> spark-ec2 
> ---
>
> Key: SPARK-8327
> URL: https://issues.apache.org/jira/browse/SPARK-8327
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.3.1
>Reporter: Vladimir Vladimirov
>Priority: Minor
>
> exception shown
> {code}
> [FAILED] Starting httpd: httpd: Syntax error on line 199 of 
> /etc/httpd/conf/httpd.conf: Cannot load modules/libphp-5.5.so into server: 
> /etc/httpd/modules/libphp-5.5.so: cannot open shared object file: No such 
> file or directory [FAILED] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

2016-04-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249141#comment-15249141
 ] 

Nicholas Chammas commented on SPARK-6527:
-

Did the s3a suggestion work? If not, did anybody file an issue as Steve 
suggested with more detail?

> sc.binaryFiles can not access files on s3
> -
>
> Key: SPARK-6527
> URL: https://issues.apache.org/jira/browse/SPARK-6527
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output
>Affects Versions: 1.2.0, 1.3.0
> Environment: I am running Spark on EC2
>Reporter: Zhao Zhang
>Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6527) sc.binaryFiles can not access files on s3

2016-04-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249141#comment-15249141
 ] 

Nicholas Chammas edited comment on SPARK-6527 at 4/20/16 2:27 AM:
--

Did the s3a suggestion work? If not, did anybody file an issue as Steve 
suggested?


was (Author: nchammas):
Did the s3a suggestion work? If not, did anybody file an issue as Steve 
suggested with more detail?

> sc.binaryFiles can not access files on s3
> -
>
> Key: SPARK-6527
> URL: https://issues.apache.org/jira/browse/SPARK-6527
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output
>Affects Versions: 1.2.0, 1.3.0
> Environment: I am running Spark on EC2
>Reporter: Zhao Zhang
>Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14741) Streaming from partitioned directory structure captures unintended partition columns

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14741:


Assignee: Tathagata Das  (was: Apache Spark)

> Streaming from partitioned directory structure captures unintended partition 
> columns
> 
>
> Key: SPARK-14741
> URL: https://issues.apache.org/jira/browse/SPARK-14741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Consider the following directory structure
> dir/col=X/some-files
> If we create a text format streaming dataframe on {dir/col=X/}, then it 
> should not consider as partitioning in columns. Even though the streaming 
> dataframe does not do so, the generated batch dataframes pick up col as a 
> partitioning columns, causing mismatch streaming source schema and generated 
> df schema. This leads to runtime failure: 
> 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: 
> Query query-0 terminated with error
> java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14741) Streaming from partitioned directory structure captures unintended partition columns

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14741:


Assignee: Apache Spark  (was: Tathagata Das)

> Streaming from partitioned directory structure captures unintended partition 
> columns
> 
>
> Key: SPARK-14741
> URL: https://issues.apache.org/jira/browse/SPARK-14741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Consider the following directory structure
> dir/col=X/some-files
> If we create a text format streaming dataframe on {dir/col=X/}, then it 
> should not consider as partitioning in columns. Even though the streaming 
> dataframe does not do so, the generated batch dataframes pick up col as a 
> partitioning columns, causing mismatch streaming source schema and generated 
> df schema. This leads to runtime failure: 
> 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: 
> Query query-0 terminated with error
> java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14741) Streaming from partitioned directory structure captures unintended partition columns

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249139#comment-15249139
 ] 

Apache Spark commented on SPARK-14741:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/12517

> Streaming from partitioned directory structure captures unintended partition 
> columns
> 
>
> Key: SPARK-14741
> URL: https://issues.apache.org/jira/browse/SPARK-14741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Consider the following directory structure
> dir/col=X/some-files
> If we create a text format streaming dataframe on {dir/col=X/}, then it 
> should not consider as partitioning in columns. Even though the streaming 
> dataframe does not do so, the generated batch dataframes pick up col as a 
> partitioning columns, causing mismatch streaming source schema and generated 
> df schema. This leads to runtime failure: 
> 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: 
> Query query-0 terminated with error
> java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14741) Streaming from partitioned directory structure captures unintended partition columns

2016-04-19 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-14741:
-

 Summary: Streaming from partitioned directory structure captures 
unintended partition columns
 Key: SPARK-14741
 URL: https://issues.apache.org/jira/browse/SPARK-14741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das


Consider the following directory structure

dir/col=X/some-files

If we create a text format streaming dataframe on {dir/col=X/}, then it should 
not consider as partitioning in columns. Even though the streaming dataframe 
does not do so, the generated batch dataframes pick up col as a partitioning 
columns, causing mismatch streaming source schema and generated df schema. This 
leads to runtime failure: 


18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: 
Query query-0 terminated with error
java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame

2016-04-19 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249109#comment-15249109
 ] 

Sun Rui commented on SPARK-12148:
-

[~felixcheung] Go ahead with this:)

> SparkR: rename DataFrame to SparkDataFrame
> --
>
> Key: SPARK-12148
> URL: https://issues.apache.org/jira/browse/SPARK-12148
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Michael Lawrence
>
> The SparkR package represents a Spark DataFrame with the class "DataFrame". 
> That conflicts with the more general DataFrame class defined in the S4Vectors 
> package. Would it not be more appropriate to use the name "SparkDataFrame" 
> instead?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249084#comment-15249084
 ] 

Apache Spark commented on SPARK-14739:
--

User 'arashpa' has created a pull request for this issue:
https://github.com/apache/spark/pull/12516

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Vishnu Prasad (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249080#comment-15249080
 ] 

Vishnu Prasad commented on SPARK-14739:
---

I've merged your PR with your test fixes. Thank you

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Arash Parsa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249078#comment-15249078
 ] 

Arash Parsa commented on SPARK-14739:
-

Sorry wasn't able to pull from your branch. I submitted a new PR with proper 
updates. Please let me know how it looks.

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249077#comment-15249077
 ] 

Apache Spark commented on SPARK-14739:
--

User 'arashpa' has created a pull request for this issue:
https://github.com/apache/spark/pull/12515

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14521:


Assignee: Apache Spark

> StackOverflowError in Kryo when executing TPC-DS Query27
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Assignee: Apache Spark
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14521:


Assignee: (was: Apache Spark)

> StackOverflowError in Kryo when executing TPC-DS Query27
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249067#comment-15249067
 ] 

Apache Spark commented on SPARK-14521:
--

User 'rajeshbalamohan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12514

> StackOverflowError in Kryo when executing TPC-DS Query27
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249061#comment-15249061
 ] 

Maciej Szymkiewicz commented on SPARK-14739:


I extracted relevant test fixes and made PR against your branch. 

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14571) Log instrumentation in ALS

2016-04-19 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249058#comment-15249058
 ] 

Timothy Hunter commented on SPARK-14571:


Yes, please feel free to take this task. Thanks!




> Log instrumentation in ALS
> --
>
> Key: SPARK-14571
> URL: https://issues.apache.org/jira/browse/SPARK-14571
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Description: 
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{noformat}
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
{noformat}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{noformat}
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
{noformat}

For the original issue raised on StackOverflow please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



  was:
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{noformat}
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
{noformat}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{noformat}
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
{noformat}

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark




> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {noformat}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> {noformat}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {noformat}
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> {noformat}
> For the original issue raised on StackOverflow please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was 

[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Description: 
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{noformat}
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
{noformat}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{noformat}
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
{noformat}

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



  was:
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{{
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
}}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{{
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
}}

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark




> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {noformat}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> {noformat}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {noformat}
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> {noformat}
> For a simple example please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Description: 
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{noformat}
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
{noformat}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{noformat}
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
{noformat}

For the original issue on StackOverflow please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



  was:
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

```
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
```

I can get the regression coefficient out, but I can't get the regularization 
parameter

```
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
```

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark




> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {noformat}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> {noformat}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {noformat}
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> {noformat}
> For the original issue on StackOverflow please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Description: 
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{{
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
}}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{{
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
}}

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



  was:
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

{noformat}
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
{noformat}

I can get the regression coefficient out, but I can't get the regularization 
parameter

{noformat}
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
{noformat}

For the original issue on StackOverflow please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark




> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> {{
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> }}
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> {{
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> }}
> For a simple example please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Shearer updated SPARK-14740:
-
Description: 
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

```
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
```

I can get the regression coefficient out, but I can't get the regularization 
parameter

```
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
```

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



  was:
If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

`
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
`

I can get the regression coefficient out, but I can't get the regularization 
parameter

`
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
`

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark




> CrossValidatorModel.bestModel does not include hyper-parameters
> ---
>
> Key: SPARK-14740
> URL: https://issues.apache.org/jira/browse/SPARK-14740
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Paul Shearer
>
> If you tune hyperparameters using a CrossValidator object in PySpark, you may 
> not be able to extract the parameter values of the best model.
> ```
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
> dataset = sqlContext.createDataFrame(
> [(Vectors.dense([0.0]), 0.0),
>  (Vectors.dense([0.4]), 1.0),
>  (Vectors.dense([0.5]), 0.0),
>  (Vectors.dense([0.6]), 1.0),
>  (Vectors.dense([1.0]), 1.0)] * 10,
> ["features", "label"])
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
> 0.0001]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset)
> ```
> I can get the regression coefficient out, but I can't get the regularization 
> parameter
> ```
> In [3]: cvModel.bestModel.coefficients
> Out[3]: DenseVector([3.1573])
> In [4]: cvModel.bestModel.explainParams()
> Out[4]: ''
> In [5]: cvModel.bestModel.extractParamMap()
> Out[5]: {}
> In [15]: cvModel.params
> Out[15]: []
> In [36]: cvModel.bestModel.params
> Out[36]: []
> ```
> For a simple example please see 
> http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Created] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters

2016-04-19 Thread Paul Shearer (JIRA)
Paul Shearer created SPARK-14740:


 Summary: CrossValidatorModel.bestModel does not include 
hyper-parameters
 Key: SPARK-14740
 URL: https://issues.apache.org/jira/browse/SPARK-14740
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1
Reporter: Paul Shearer


If you tune hyperparameters using a CrossValidator object in PySpark, you may 
not be able to extract the parameter values of the best model.

`
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
 (Vectors.dense([0.4]), 1.0),
 (Vectors.dense([0.5]), 0.0),
 (Vectors.dense([0.6]), 1.0),
 (Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 
0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
`

I can get the regression coefficient out, but I can't get the regularization 
parameter

`
In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
`

For a simple example please see 
http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249053#comment-15249053
 ] 

Maciej Szymkiewicz edited comment on SPARK-14739 at 4/20/16 12:47 AM:
--

Sure, but your latest PR still doesn't resolve problem with dead tests. Instead 
of copying you could actually pull changes from my repo.




was (Author: zero323):
Sure, but your latest PR still doesn't resolve problem with dead tests.


> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249053#comment-15249053
 ] 

Maciej Szymkiewicz commented on SPARK-14739:


Sure, but your latest PR still doesn't resolve problem with dead tests.


> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13929) Use Scala reflection for UDFs

2016-04-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13929.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12149
[https://github.com/apache/spark/pull/12149]

> Use Scala reflection for UDFs
> -
>
> Key: SPARK-13929
> URL: https://issues.apache.org/jira/browse/SPARK-13929
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jakob Odersky
>Priority: Minor
> Fix For: 2.0.0
>
>
> {{ScalaReflection}} uses native Java reflection for User Defined Types which 
> would fail if such types are not plain Scala classes that map 1:1 to Java.
> Consider the following extract (from here 
> https://github.com/apache/spark/blob/92024797a4fad594b5314f3f3be5c6be2434de8a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L376
>  ):
> {code}
> case t if Utils.classIsLoadable(className) &&
> Utils.classForName(className).isAnnotationPresent(classOf[SQLUserDefinedType])
>  =>
> val udt = 
> Utils.classForName(className).getAnnotation(classOf[SQLUserDefinedType]).udt().newInstance()
> //...
> {code}
> If {{t}}'s runtime class is actually synthetic (something that doesn't exist 
> in Java and hence uses a dollar sign internally), such as nested classes or 
> package objects, the above code will fail.
> Currently there are no known use-cases of synthetic user-defined types (hence 
> the minor priority), however it would be best practice to remove plain Java 
> reflection and rely on Scala reflection instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249036#comment-15249036
 ] 

Apache Spark commented on SPARK-14739:
--

User 'vishnu667' has created a pull request for this issue:
https://github.com/apache/spark/pull/12513

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14735) PySpark HashingTF hashAlgorithm param + docs

2016-04-19 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249034#comment-15249034
 ] 

zhengruifeng commented on SPARK-14735:
--

I can work on this after SPARK-10574 is resolved

> PySpark HashingTF hashAlgorithm param + docs
> 
>
> Key: SPARK-14735
> URL: https://issues.apache.org/jira/browse/SPARK-14735
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>
> Add hashAlgorithm param to HashingTF in PySpark, and update docs to indicate 
> default algorithm is MumurHash3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14407) Hide HadoopFsRelation related data source API to execution package

2016-04-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14407.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12361
[https://github.com/apache/spark/pull/12361]

> Hide HadoopFsRelation related data source API to execution package
> --
>
> Key: SPARK-14407
> URL: https://issues.apache.org/jira/browse/SPARK-14407
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27

2016-04-19 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249012#comment-15249012
 ] 

Rajesh Balamohan edited comment on SPARK-14521 at 4/20/16 12:31 AM:


Update:
- By default, spark-thrift server disables "spark.kryo.referenceTracking" (if 
not specified in conf).
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L55

- When "spark.kryo.referenceTracking" is set to true explicitly in 
spark-defaults.conf, query executes successfully. Alternatively, if 
"spark.sql.autoBroadcastJoinThreshold" can be set to a very low value in order 
to prevent broacasting (this is done just for verification).  

- Recent changes LongHashedRelation could have introduced loops which would 
need "spark.kryo.referenceTracking=true" in spark-thrift server. I will create 
a PR for this.


was (Author: rajesh.balamohan):
Update:
- By default, spark-thrift server disables "spark.kryo.referenceTracking".
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L55

- When "spark.kryo.referenceTracking" is set to true explicitly in 
spark-defaults.conf, query executes successfully. Alternatively, if 
"spark.sql.autoBroadcastJoinThreshold" can be set to a very low value in order 
to prevent broacasting (this is done just for verification).  

- Recent changes LongHashedRelation could have introduced loops which would 
need "spark.kryo.referenceTracking=true" in spark-thrift server. I will create 
a PR for this.

> StackOverflowError in Kryo when executing TPC-DS Query27
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> 

[jira] [Resolved] (SPARK-14717) Scala, Python APIs for Dataset.unpersist differ in default blocking value

2016-04-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14717.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12507
[https://github.com/apache/spark/pull/12507]

> Scala, Python APIs for Dataset.unpersist differ in default blocking value
> -
>
> Key: SPARK-14717
> URL: https://issues.apache.org/jira/browse/SPARK-14717
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>
> In Scala/Java {{Dataset.unpersist()}} sets blocking = false by default, but 
> in Python, it is set to True by default.  We should presumably make them 
> consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Arash Parsa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249028#comment-15249028
 ] 

Arash Parsa commented on SPARK-14739:
-

[~zero323] sure I can adjust my PR (move the tests), but since I found the bug 
do you think I should be getting the fix in? 

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249020#comment-15249020
 ] 

Apache Spark commented on SPARK-14739:
--

User 'vishnu667' has created a pull request for this issue:
https://github.com/apache/spark/pull/12512

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27

2016-04-19 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249012#comment-15249012
 ] 

Rajesh Balamohan commented on SPARK-14521:
--

Update:
- By default, spark-thrift server disables "spark.kryo.referenceTracking".
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L55

- When "spark.kryo.referenceTracking" is set to true explicitly in 
spark-defaults.conf, query executes successfully. Alternatively, if 
"spark.sql.autoBroadcastJoinThreshold" can be set to a very low value in order 
to prevent broacasting (this is done just for verification).  

- Recent changes LongHashedRelation could have introduced loops which would 
need "spark.kryo.referenceTracking=true" in spark-thrift server. I will create 
a PR for this.

> StackOverflowError in Kryo when executing TPC-DS Query27
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Blocker
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-14639) Add `bround` function in Python/R.

2016-04-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14639:
--
Component/s: SparkR
 PySpark

> Add `bround` function in Python/R.
> --
>
> Key: SPARK-14639
> URL: https://issues.apache.org/jira/browse/SPARK-14639
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Reporter: Dongjoon Hyun
>
> This issue aims to expose Scala `bround` function in Python/R API.
> `bround` function is implemented in SPARK-14614 by extending current `round` 
> function.
> We used the following semantics from 
> [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java].
> {code}
> public static double bround(double input, int scale) {
> if (Double.isNaN(input) || Double.isInfinite(input)) {
>   return input;
> }
> return BigDecimal.valueOf(input).setScale(scale, 
> RoundingMode.HALF_EVEN).doubleValue();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248994#comment-15248994
 ] 

Maciej Szymkiewicz commented on SPARK-14739:


This solves only small part of the problem. Right now both Sparse and Dense 
vector parsing is broken, not to mention corresponding tests are dead code.

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248995#comment-15248995
 ] 

Apache Spark commented on SPARK-14739:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/12511

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Arash Parsa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248973#comment-15248973
 ] 

Arash Parsa commented on SPARK-14739:
-

Thanks for posting the ticket on Jira, I created the PR here:
https://github.com/apache/spark/pull/12510


> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14739:


Assignee: (was: Apache Spark)

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248972#comment-15248972
 ] 

Apache Spark commented on SPARK-14739:
--

User 'arashpa' has created a pull request for this issue:
https://github.com/apache/spark/pull/12510

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14739:


Assignee: Apache Spark

> Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with 
> no indices
> ---
>
> Key: SPARK-14739
> URL: https://issues.apache.org/jira/browse/SPARK-14739
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>
> DenseVector:
> {code}
> Vectors.parse(str(Vectors.dense([])))
> ## ValueErrorTraceback (most recent call last)
> ## .. 
> ## ValueError: Unable to parse values from
> {code}
> SparseVector:
> {code}
> Vectors.parse(str(Vectors.sparse(5, [], [])))
> ## ValueErrorTraceback (most recent call last)
> ##  ... 
> ## ValueError: Unable to parse indices from .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14719) WriteAheadLogBasedBlockHandler should ignore BlockManager put errors

2016-04-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14719:
---
Target Version/s:   (was: 2.0.0)

> WriteAheadLogBasedBlockHandler should ignore BlockManager put errors
> 
>
> Key: SPARK-14719
> URL: https://issues.apache.org/jira/browse/SPARK-14719
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> {{WriteAheadLogBasedBlockHandler}} will currently throw exceptions if 
> BlockManager puts fail, even though those puts are only performed as a 
> performance optimization. Instead, it should log and ignore exceptions 
> originating from the block manager put.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14719) WriteAheadLogBasedBlockHandler should ignore BlockManager put errors

2016-04-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14719.

Resolution: Won't Fix

> WriteAheadLogBasedBlockHandler should ignore BlockManager put errors
> 
>
> Key: SPARK-14719
> URL: https://issues.apache.org/jira/browse/SPARK-14719
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> {{WriteAheadLogBasedBlockHandler}} will currently throw exceptions if 
> BlockManager puts fail, even though those puts are only performed as a 
> performance optimization. Instead, it should log and ignore exceptions 
> originating from the block manager put.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14719) WriteAheadLogBasedBlockHandler should ignore BlockManager put errors

2016-04-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14719:
---
Fix Version/s: (was: 2.0.0)

> WriteAheadLogBasedBlockHandler should ignore BlockManager put errors
> 
>
> Key: SPARK-14719
> URL: https://issues.apache.org/jira/browse/SPARK-14719
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> {{WriteAheadLogBasedBlockHandler}} will currently throw exceptions if 
> BlockManager puts fail, even though those puts are only performed as a 
> performance optimization. Instead, it should log and ignore exceptions 
> originating from the block manager put.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14719) WriteAheadLogBasedBlockHandler should ignore BlockManager put errors

2016-04-19 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reopened SPARK-14719:


Reverted this patch and am marking as "won't fix" for now. See discussion on PR 
for explanation of why I am reluctant to modify this code now.

> WriteAheadLogBasedBlockHandler should ignore BlockManager put errors
> 
>
> Key: SPARK-14719
> URL: https://issues.apache.org/jira/browse/SPARK-14719
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> {{WriteAheadLogBasedBlockHandler}} will currently throw exceptions if 
> BlockManager puts fail, even though those puts are only performed as a 
> performance optimization. Instead, it should log and ignore exceptions 
> originating from the block manager put.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices

2016-04-19 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-14739:
--

 Summary: Vectors.parse doesn't handle dense vectors of size 0 and 
sparse vectros with no indices
 Key: SPARK-14739
 URL: https://issues.apache.org/jira/browse/SPARK-14739
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.6.0, 2.0.0
Reporter: Maciej Szymkiewicz


DenseVector:

{code}
Vectors.parse(str(Vectors.dense([])))
## ValueErrorTraceback (most recent call last)
## .. 
## ValueError: Unable to parse values from
{code}

SparseVector:

{code}
Vectors.parse(str(Vectors.sparse(5, [], [])))
## ValueErrorTraceback (most recent call last)
##  ... 
## ValueError: Unable to parse indices from .
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14571) Log instrumentation in ALS

2016-04-19 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248859#comment-15248859
 ] 

Miao Wang commented on SPARK-14571:
---

Since nobody takes this one, can I learn it and give it a try?

Thanks!

Miao

> Log instrumentation in ALS
> --
>
> Key: SPARK-14571
> URL: https://issues.apache.org/jira/browse/SPARK-14571
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12224) R support for JDBC source

2016-04-19 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12224.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10480
[https://github.com/apache/spark/pull/10480]

> R support for JDBC source
> -
>
> Key: SPARK-12224
> URL: https://issues.apache.org/jira/browse/SPARK-12224
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12224) R support for JDBC source

2016-04-19 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12224:
--
Assignee: Felix Cheung  (was: Apache Spark)

> R support for JDBC source
> -
>
> Key: SPARK-12224
> URL: https://issues.apache.org/jira/browse/SPARK-12224
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14733) Allow custom timing control in microbenchmarks

2016-04-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14733.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.0.0

> Allow custom timing control in microbenchmarks
> --
>
> Key: SPARK-14733
> URL: https://issues.apache.org/jira/browse/SPARK-14733
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> The current benchmark framework runs a code block for several iterations and 
> reports statistics. However there is no way to exclude per-iteration setup 
> time from the overall results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14639) Add `bround` function in Python/R.

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248848#comment-15248848
 ] 

Apache Spark commented on SPARK-14639:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12509

> Add `bround` function in Python/R.
> --
>
> Key: SPARK-14639
> URL: https://issues.apache.org/jira/browse/SPARK-14639
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>
> This issue aims to expose Scala `bround` function in Python/R API.
> `bround` function is implemented in SPARK-14614 by extending current `round` 
> function.
> We used the following semantics from 
> [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java].
> {code}
> public static double bround(double input, int scale) {
> if (Double.isNaN(input) || Double.isInfinite(input)) {
>   return input;
> }
> return BigDecimal.valueOf(input).setScale(scale, 
> RoundingMode.HALF_EVEN).doubleValue();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14639) Add `bround` function in Python/R.

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14639:


Assignee: (was: Apache Spark)

> Add `bround` function in Python/R.
> --
>
> Key: SPARK-14639
> URL: https://issues.apache.org/jira/browse/SPARK-14639
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>
> This issue aims to expose Scala `bround` function in Python/R API.
> `bround` function is implemented in SPARK-14614 by extending current `round` 
> function.
> We used the following semantics from 
> [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java].
> {code}
> public static double bround(double input, int scale) {
> if (Double.isNaN(input) || Double.isInfinite(input)) {
>   return input;
> }
> return BigDecimal.valueOf(input).setScale(scale, 
> RoundingMode.HALF_EVEN).doubleValue();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14639) Add `bround` function in Python/R.

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14639:


Assignee: Apache Spark

> Add `bround` function in Python/R.
> --
>
> Key: SPARK-14639
> URL: https://issues.apache.org/jira/browse/SPARK-14639
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to expose Scala `bround` function in Python/R API.
> `bround` function is implemented in SPARK-14614 by extending current `round` 
> function.
> We used the following semantics from 
> [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java].
> {code}
> public static double bround(double input, int scale) {
> if (Double.isNaN(input) || Double.isInfinite(input)) {
>   return input;
> }
> return BigDecimal.valueOf(input).setScale(scale, 
> RoundingMode.HALF_EVEN).doubleValue();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14478) Should StandardScaler use biased variance to scale?

2016-04-19 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248825#comment-15248825
 ] 

Joseph K. Bradley commented on SPARK-14478:
---

Adding a param seems reasonable, though probably pretty low priority.  To make 
a judgement call...how about we leave it as is for now?  I'll send a PR to 
document that it's using unbiased variance.  If any user ever needs biased, 
then we can add the Param (but I've never heard anyone except myself complain).

> Should StandardScaler use biased variance to scale?
> ---
>
> Key: SPARK-14478
> URL: https://issues.apache.org/jira/browse/SPARK-14478
> Project: Spark
>  Issue Type: Question
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>
> Currently, MLlib's StandardScaler scales columns using the unbiased standard 
> deviation.  This matches what R's scale package does.
> However, it is a bit odd for 2 reasons:
> * Optimization/ML algorithms which require scaled columns generally assume 
> unit variance (for mathematical convenience).  That requires using biased 
> variance.
> * scikit-learn, MLlib's GLMs, and R's glmnet package all use biased variance.
> *Question*: Should we switch to unbiased?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-04-19 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248822#comment-15248822
 ] 

Joseph K. Bradley commented on SPARK-8884:
--

I'm not sure this will make 2.0, so I'm changing the target to 2.1.  [~mengxr] 
please retarget if needed.

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-04-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8884:
-
Target Version/s: 2.1.0  (was: 2.0.0)

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10388) Public dataset loader interface

2016-04-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10388:
--
Target Version/s: 2.1.0  (was: 2.0.0)

> Public dataset loader interface
> ---
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
> Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf
>
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4226) SparkSQL - Add support for subqueries in predicates

2016-04-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-4226.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12306
[https://github.com/apache/spark/pull/12306]

> SparkSQL - Add support for subqueries in predicates
> ---
>
> Key: SPARK-4226
> URL: https://issues.apache.org/jira/browse/SPARK-4226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: Spark 1.2 snapshot
>Reporter: Terry Siu
> Fix For: 2.0.0
>
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_SUBQUERY_EXPR
> TOK_SUBQUERY_OP
>   in
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
> sparkbug
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_TABLE_OR_COL
>   customerid
> TOK_WHERE
>   TOK_FUNCTION
> in
> TOK_TABLE_OR_COL
>   customerid
> 2
> 3
> TOK_TABLE_OR_COL
>   customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
> in
>   TOK_QUERY
> TOK_FROM
>   TOK_TABREF
> TOK_TABNAME
>   sparkbug
> TOK_INSERT
>   TOK_DESTINATION
> TOK_DIR
>   TOK_TMP_FILE
>   TOK_SELECT
> TOK_SELEXPR
>   TOK_TABLE_OR_COL
> customerid
>   TOK_WHERE
> TOK_FUNCTION
>   in
>   TOK_TABLE_OR_COL
> customerid
>   2
>   3
>   TOK_TABLE_OR_COL
> customerid
> " +
>  
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
> 
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14738) Separate Docker Integration Tests from main spark build

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248788#comment-15248788
 ] 

Apache Spark commented on SPARK-14738:
--

User 'lresende' has created a pull request for this issue:
https://github.com/apache/spark/pull/12508

> Separate Docker Integration Tests from main spark build
> ---
>
> Key: SPARK-14738
> URL: https://issues.apache.org/jira/browse/SPARK-14738
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Reporter: Luciano Resende
>
> Currently docker integration tests are run as part of the main build, but it 
> requires dev machines to have all the required docker installation setup 
> which most of cases is not available and thus the tests will fail.
> this would separate the tests from the main spark build, and make them 
> optional which then could be invoked manually or as part of CI tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14738) Separate Docker Integration Tests from main spark build

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14738:


Assignee: (was: Apache Spark)

> Separate Docker Integration Tests from main spark build
> ---
>
> Key: SPARK-14738
> URL: https://issues.apache.org/jira/browse/SPARK-14738
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Reporter: Luciano Resende
>
> Currently docker integration tests are run as part of the main build, but it 
> requires dev machines to have all the required docker installation setup 
> which most of cases is not available and thus the tests will fail.
> this would separate the tests from the main spark build, and make them 
> optional which then could be invoked manually or as part of CI tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14738) Separate Docker Integration Tests from main spark build

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14738:


Assignee: Apache Spark

> Separate Docker Integration Tests from main spark build
> ---
>
> Key: SPARK-14738
> URL: https://issues.apache.org/jira/browse/SPARK-14738
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Reporter: Luciano Resende
>Assignee: Apache Spark
>
> Currently docker integration tests are run as part of the main build, but it 
> requires dev machines to have all the required docker installation setup 
> which most of cases is not available and thus the tests will fail.
> this would separate the tests from the main spark build, and make them 
> optional which then could be invoked manually or as part of CI tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14738) Separate Docker Integration Tests from main spark build

2016-04-19 Thread Luciano Resende (JIRA)
Luciano Resende created SPARK-14738:
---

 Summary: Separate Docker Integration Tests from main spark build
 Key: SPARK-14738
 URL: https://issues.apache.org/jira/browse/SPARK-14738
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Reporter: Luciano Resende


Currently docker integration tests are run as part of the main build, but it 
requires dev machines to have all the required docker installation setup which 
most of cases is not available and thus the tests will fail.

this would separate the tests from the main spark build, and make them optional 
which then could be invoked manually or as part of CI tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14717) Scala, Python APIs for Dataset.unpersist differ in default blocking value

2016-04-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14717:
---
Assignee: Felix Cheung

> Scala, Python APIs for Dataset.unpersist differ in default blocking value
> -
>
> Key: SPARK-14717
> URL: https://issues.apache.org/jira/browse/SPARK-14717
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Minor
>
> In Scala/Java {{Dataset.unpersist()}} sets blocking = false by default, but 
> in Python, it is set to True by default.  We should presumably make them 
> consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14717) Scala, Python APIs for Dataset.unpersist differ in default blocking value

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14717:


Assignee: Apache Spark

> Scala, Python APIs for Dataset.unpersist differ in default blocking value
> -
>
> Key: SPARK-14717
> URL: https://issues.apache.org/jira/browse/SPARK-14717
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> In Scala/Java {{Dataset.unpersist()}} sets blocking = false by default, but 
> in Python, it is set to True by default.  We should presumably make them 
> consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14717) Scala, Python APIs for Dataset.unpersist differ in default blocking value

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14717:


Assignee: (was: Apache Spark)

> Scala, Python APIs for Dataset.unpersist differ in default blocking value
> -
>
> Key: SPARK-14717
> URL: https://issues.apache.org/jira/browse/SPARK-14717
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In Scala/Java {{Dataset.unpersist()}} sets blocking = false by default, but 
> in Python, it is set to True by default.  We should presumably make them 
> consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14717) Scala, Python APIs for Dataset.unpersist differ in default blocking value

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248721#comment-15248721
 ] 

Apache Spark commented on SPARK-14717:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/12507

> Scala, Python APIs for Dataset.unpersist differ in default blocking value
> -
>
> Key: SPARK-14717
> URL: https://issues.apache.org/jira/browse/SPARK-14717
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In Scala/Java {{Dataset.unpersist()}} sets blocking = false by default, but 
> in Python, it is set to True by default.  We should presumably make them 
> consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14042) Add support for custom coalescers

2016-04-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14042.
-
   Resolution: Fixed
 Assignee: Nezih Yigitbasi
Fix Version/s: 2.0.0

> Add support for custom coalescers
> -
>
> Key: SPARK-14042
> URL: https://issues.apache.org/jira/browse/SPARK-14042
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Nezih Yigitbasi
>Assignee: Nezih Yigitbasi
> Fix For: 2.0.0
>
>
> Per our discussion on the mailing list (please see 
> [here|http://mail-archives.apache.org/mod_mbox//spark-dev/201602.mbox/%3CCA+g63F7aVRBH=WyyK3nvBSLCMPtSdUuL_Ge9=ww4dnmnvy4...@mail.gmail.com%3E])
>  it would be nice to specify a custom coalescing policy as the current 
> {{coalesce()}} method only allows the user to specify the number of 
> partitions and we cannot really control much. The need for this feature 
> popped up when I wanted to merge small files by coalescing them by size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14656) Benchmark.getPorcessorName() always return "Unknown processor" on Linux

2016-04-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-14656.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.0.0

> Benchmark.getPorcessorName() always return "Unknown processor" on Linux
> ---
>
> Key: SPARK-14656
> URL: https://issues.apache.org/jira/browse/SPARK-14656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Linux
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we call org.apache.spark.util.Benchmark.getPorcessorName() on Linux, it 
> always return {{"Unknown processor"}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14736:


Assignee: (was: Apache Spark)

> Deadlock in registering applications while the Master is in the RECOVERING 
> mode
> ---
>
> Key: SPARK-14736
> URL: https://issues.apache.org/jira/browse/SPARK-14736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.0, 1.6.0
> Environment: unix, Spark cluster with a custom 
> StandaloneRecoveryModeFactory and a custom PersistenceEngine
>Reporter: niranda perera
>Priority: Critical
>
> I have encountered the following issue in the standalone recovery mode. 
> Let's say there was an application A running in the cluster. Due to some 
> issue, the entire cluster, together with the application A goes down. 
> Then later on, cluster comes back online, and the master then goes into the 
> 'recovering' mode, because it sees some apps, workers and drivers have 
> already been in the cluster from Persistence Engine. While in the recovery 
> process, the application comes back online, but now it would have a different 
> ID, let's say B. 
> But then, as per the master, application registration logic, this application 
> B will NOT be added to the 'waitingApps' with the message ""Attempted to 
> re-register application at same address". [1]
>   private def registerApplication(app: ApplicationInfo): Unit = {
> val appAddress = app.driver.address
> if (addressToApp.contains(appAddress)) {
>   logInfo("Attempted to re-register application at same address: " + 
> appAddress)
>   return
> }
> The problem here is, master is trying to recover application A, which is not 
> in there anymore. Therefore after the recovery process, app A will be 
> dropped. However app A's successor, app B was also omitted from the 
> 'waitingApps' list because it had the same address as App A previously. 
> This creates a deadlock in the cluster, app A nor app B is available in the 
> cluster. 
> When the master is in the RECOVERING mode, shouldn't it add all the 
> registering apps to a list first, and then after the recovery is completed 
> (once the unsuccessful recoveries are removed), deploy the apps which are new?
> This would sort this deadlock IMO?
> [1] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode

2016-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14736:


Assignee: Apache Spark

> Deadlock in registering applications while the Master is in the RECOVERING 
> mode
> ---
>
> Key: SPARK-14736
> URL: https://issues.apache.org/jira/browse/SPARK-14736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.0, 1.6.0
> Environment: unix, Spark cluster with a custom 
> StandaloneRecoveryModeFactory and a custom PersistenceEngine
>Reporter: niranda perera
>Assignee: Apache Spark
>Priority: Critical
>
> I have encountered the following issue in the standalone recovery mode. 
> Let's say there was an application A running in the cluster. Due to some 
> issue, the entire cluster, together with the application A goes down. 
> Then later on, cluster comes back online, and the master then goes into the 
> 'recovering' mode, because it sees some apps, workers and drivers have 
> already been in the cluster from Persistence Engine. While in the recovery 
> process, the application comes back online, but now it would have a different 
> ID, let's say B. 
> But then, as per the master, application registration logic, this application 
> B will NOT be added to the 'waitingApps' with the message ""Attempted to 
> re-register application at same address". [1]
>   private def registerApplication(app: ApplicationInfo): Unit = {
> val appAddress = app.driver.address
> if (addressToApp.contains(appAddress)) {
>   logInfo("Attempted to re-register application at same address: " + 
> appAddress)
>   return
> }
> The problem here is, master is trying to recover application A, which is not 
> in there anymore. Therefore after the recovery process, app A will be 
> dropped. However app A's successor, app B was also omitted from the 
> 'waitingApps' list because it had the same address as App A previously. 
> This creates a deadlock in the cluster, app A nor app B is available in the 
> cluster. 
> When the master is in the RECOVERING mode, shouldn't it add all the 
> registering apps to a list first, and then after the recovery is completed 
> (once the unsuccessful recoveries are removed), deploy the apps which are new?
> This would sort this deadlock IMO?
> [1] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248693#comment-15248693
 ] 

Apache Spark commented on SPARK-14736:
--

User 'nirandaperera' has created a pull request for this issue:
https://github.com/apache/spark/pull/12506

> Deadlock in registering applications while the Master is in the RECOVERING 
> mode
> ---
>
> Key: SPARK-14736
> URL: https://issues.apache.org/jira/browse/SPARK-14736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.0, 1.6.0
> Environment: unix, Spark cluster with a custom 
> StandaloneRecoveryModeFactory and a custom PersistenceEngine
>Reporter: niranda perera
>Priority: Critical
>
> I have encountered the following issue in the standalone recovery mode. 
> Let's say there was an application A running in the cluster. Due to some 
> issue, the entire cluster, together with the application A goes down. 
> Then later on, cluster comes back online, and the master then goes into the 
> 'recovering' mode, because it sees some apps, workers and drivers have 
> already been in the cluster from Persistence Engine. While in the recovery 
> process, the application comes back online, but now it would have a different 
> ID, let's say B. 
> But then, as per the master, application registration logic, this application 
> B will NOT be added to the 'waitingApps' with the message ""Attempted to 
> re-register application at same address". [1]
>   private def registerApplication(app: ApplicationInfo): Unit = {
> val appAddress = app.driver.address
> if (addressToApp.contains(appAddress)) {
>   logInfo("Attempted to re-register application at same address: " + 
> appAddress)
>   return
> }
> The problem here is, master is trying to recover application A, which is not 
> in there anymore. Therefore after the recovery process, app A will be 
> dropped. However app A's successor, app B was also omitted from the 
> 'waitingApps' list because it had the same address as App A previously. 
> This creates a deadlock in the cluster, app A nor app B is available in the 
> cluster. 
> When the master is in the RECOVERING mode, shouldn't it add all the 
> registering apps to a list first, and then after the recovery is completed 
> (once the unsuccessful recoveries are removed), deploy the apps which are new?
> This would sort this deadlock IMO?
> [1] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14737) Kafka Brokers are down - spark stream should retry

2016-04-19 Thread Faisal (JIRA)
Faisal created SPARK-14737:
--

 Summary: Kafka Brokers are down - spark stream should retry
 Key: SPARK-14737
 URL: https://issues.apache.org/jira/browse/SPARK-14737
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
 Environment: Suse Linux, Cloudera Enterprise 5.4.8 (#7 built by 
jenkins on 20151023-1205 git: d7dbdf29ac1d57ae9fb19958502d50dcf4e4fffd), 
kafka_2.10-0.8.2.2

Reporter: Faisal


I have spark streaming application that uses direct streaming - listening to 
KAFKA topic.
{code}
HashMap kafkaParams = new HashMap();
kafkaParams.put("metadata.broker.list", "broker1,broker2,broker3");
kafkaParams.put("auto.offset.reset", "largest");

HashSet topicsSet = new HashSet();
topicsSet.add("Topic1");

JavaPairInputDStream messages = 
KafkaUtils.createDirectStream(
jssc, 
String.class, 
String.class,
StringDecoder.class, 
StringDecoder.class, 
kafkaParams, 
topicsSet
);
{code}

I notice when i stop/shutdown kafka brokers, my spark application also shutdown.

Here is the spark execution script
{code}
spark-submit \
--master yarn-cluster \
--files /home/siddiquf/spark/log4j-spark.xml
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \
--class com.example.MyDataStreamProcessor \
myapp.jar 
{code}

Spark job submitted successfully and i can track the application driver and 
worker/executor nodes.

Everything works fine but only concern if kafka borkers are offline or 
restarted my application controlled by yarn should not shutdown? but it does.

If this is expected behavior then how to handle such situation with least 
maintenance? Keeping in mind Kafka cluster is not in hadoop cluster and managed 
by different team that is why requires our application to be resilient enough.

Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-19 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248627#comment-15248627
 ] 

Simeon Simeonov edited comment on SPARK-10574 at 4/19/16 9:05 PM:
--

[~josephkb] I agree that it would be an improvement. The issue I see with the 
current patch is that it would be an incompatible API change in the future 
(specifying hashing functions as objects and not by name). If we make just this 
one change everything else can be handled with no API changes, e.g., seeds are 
just constructor parameters or closure variables available to the hashing 
function and collision detection is just decoration. 

That's my practical argument related to MLlib. 

Beyond that, there are multiple arguments related to the usability, testability 
and maintainability of the Spark codebase, which has high code change velocity 
from a large number of committers, which contributes to a high issue rate. The 
simplest way to battle this is one design decision at a time. The PR hard-codes 
what is essentially a strategy pattern in the implementation of an object. It 
conflates responsibilities. It introduces branching this makes testing and 
documentation more complicated. If hashing functions are externalized, they 
could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as 
input it could also be tested much more simply with any function. The behavior 
and the APIs become simpler to document because they do one thing. Etc. 

Perhaps I'm only seeing the benefits of externalizing the hashing strategy and 
missing the complexity in what I propose? We have ample examples of Spark APIs 
using functions as inputs so there are standard ways to handle this across 
languages. We don't need a custom trait if we stick to {{Any}} as the hashing 
function input. What else could be a problem?


was (Author: simeons):
[~josephkb] I agree that it would be an improvement. The issue I see with the 
current patch is that it would be an incompatible API change in the future 
(specifying hashing functions as objects and not by name). If we make just this 
one change everything else can be handled with no API changes, e.g., seeds are 
just constructor parameters or closure variables available to the hashing 
function and collision detection is just decoration. 

That's my practical argument related to MLlib. 

Beyond that, there are multiple arguments related to the usability, testability 
and maintainability of the Spark codebase, which has high code change velocity 
from a large number of committers, which contributes to a high issue rate. The 
simplest way to battle this is one design decision at a time. The PR hard-codes 
what is essentially a strategy pattern in the implementation of an object. It 
conflates responsibilities. It introduces branching this makes testing and 
documentation more complicated. If hashing functions are externalized, they 
could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as 
input it could also be tested much more simply with any function. The 
documentation and the APIs become simpler to document because they do one 
thing. Etc. 

Perhaps I'm only seeing the benefits of externalizing the hashing strategy and 
missing the complexity in what I propose? We have ample examples of Spark APIs 
using functions as inputs so there are standard ways to handle this across 
languages. We don't need a custom trait if we stick to {{Any}} as the hashing 
function input. What else could be a problem?

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> 

[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-19 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248627#comment-15248627
 ] 

Simeon Simeonov commented on SPARK-10574:
-

[~josephkb] I agree that it would be an improvement. The issue I see with the 
current patch is that it would be an incompatible API change in the future 
(specifying hashing functions as objects and not by name). If we make just this 
one change everything else can be handled with no API changes, e.g., seeds are 
just constructor parameters or closure variables available to the hashing 
function and collision detection is just decoration. 

That's my practical argument related to MLlib. 

Beyond that, there are multiple arguments related to the usability, testability 
and maintainability of the Spark codebase, which has high code change velocity 
from a large number of committers, which contributes to a high issue rate. The 
simplest way to battle this is one design decision at a time. The PR hard-codes 
what is essentially a strategy pattern in the implementation of an object. It 
conflates responsibilities. It introduces branching this makes testing and 
documentation more complicated. If hashing functions are externalized, they 
could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as 
input it could also be tested much more simply with any function. The 
documentation and the APIs become simpler to document because they do one 
thing. Etc. 

Perhaps I'm only seeing the benefits of externalizing the hashing strategy and 
missing the complexity in what I propose? We have ample examples of Spark APIs 
using functions as inputs so there are standard ways to handle this across 
languages. We don't need a custom trait if we stick to {{Any}} as the hashing 
function input. What else could be a problem?

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode

2016-04-19 Thread niranda perera (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niranda perera updated SPARK-14736:
---
Affects Version/s: 1.5.0
   1.6.0

> Deadlock in registering applications while the Master is in the RECOVERING 
> mode
> ---
>
> Key: SPARK-14736
> URL: https://issues.apache.org/jira/browse/SPARK-14736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.0, 1.6.0
> Environment: unix, Spark cluster with a custom 
> StandaloneRecoveryModeFactory and a custom PersistenceEngine
>Reporter: niranda perera
>Priority: Critical
>
> I have encountered the following issue in the standalone recovery mode. 
> Let's say there was an application A running in the cluster. Due to some 
> issue, the entire cluster, together with the application A goes down. 
> Then later on, cluster comes back online, and the master then goes into the 
> 'recovering' mode, because it sees some apps, workers and drivers have 
> already been in the cluster from Persistence Engine. While in the recovery 
> process, the application comes back online, but now it would have a different 
> ID, let's say B. 
> But then, as per the master, application registration logic, this application 
> B will NOT be added to the 'waitingApps' with the message ""Attempted to 
> re-register application at same address". [1]
>   private def registerApplication(app: ApplicationInfo): Unit = {
> val appAddress = app.driver.address
> if (addressToApp.contains(appAddress)) {
>   logInfo("Attempted to re-register application at same address: " + 
> appAddress)
>   return
> }
> The problem here is, master is trying to recover application A, which is not 
> in there anymore. Therefore after the recovery process, app A will be 
> dropped. However app A's successor, app B was also omitted from the 
> 'waitingApps' list because it had the same address as App A previously. 
> This creates a deadlock in the cluster, app A nor app B is available in the 
> cluster. 
> When the master is in the RECOVERING mode, shouldn't it add all the 
> registering apps to a list first, and then after the recovery is completed 
> (once the unsuccessful recoveries are removed), deploy the apps which are new?
> This would sort this deadlock IMO?
> [1] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode

2016-04-19 Thread niranda perera (JIRA)
niranda perera created SPARK-14736:
--

 Summary: Deadlock in registering applications while the Master is 
in the RECOVERING mode
 Key: SPARK-14736
 URL: https://issues.apache.org/jira/browse/SPARK-14736
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
 Environment: unix, Spark cluster with a custom 
StandaloneRecoveryModeFactory and a custom PersistenceEngine
Reporter: niranda perera
Priority: Critical


I have encountered the following issue in the standalone recovery mode. 

Let's say there was an application A running in the cluster. Due to some issue, 
the entire cluster, together with the application A goes down. 

Then later on, cluster comes back online, and the master then goes into the 
'recovering' mode, because it sees some apps, workers and drivers have already 
been in the cluster from Persistence Engine. While in the recovery process, the 
application comes back online, but now it would have a different ID, let's say 
B. 

But then, as per the master, application registration logic, this application B 
will NOT be added to the 'waitingApps' with the message ""Attempted to 
re-register application at same address". [1]

  private def registerApplication(app: ApplicationInfo): Unit = {
val appAddress = app.driver.address
if (addressToApp.contains(appAddress)) {
  logInfo("Attempted to re-register application at same address: " + 
appAddress)
  return
}


The problem here is, master is trying to recover application A, which is not in 
there anymore. Therefore after the recovery process, app A will be dropped. 
However app A's successor, app B was also omitted from the 'waitingApps' list 
because it had the same address as App A previously. 

This creates a deadlock in the cluster, app A nor app B is available in the 
cluster. 

When the master is in the RECOVERING mode, shouldn't it add all the registering 
apps to a list first, and then after the recovery is completed (once the 
unsuccessful recoveries are removed), deploy the apps which are new?

This would sort this deadlock IMO?

[1] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14735) PySpark HashingTF hashAlgorithm param + docs

2016-04-19 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-14735:
-

 Summary: PySpark HashingTF hashAlgorithm param + docs
 Key: SPARK-14735
 URL: https://issues.apache.org/jira/browse/SPARK-14735
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Joseph K. Bradley


Add hashAlgorithm param to HashingTF in PySpark, and update docs to indicate 
default algorithm is MumurHash3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10574:
--
Shepherd: Joseph K. Bradley

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14692) Error While Setting the path for R front end

2016-04-19 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248557#comment-15248557
 ] 

Felix Cheung commented on SPARK-14692:
--

is Sys.getenv("SPARK_HOME") valid?

> Error While Setting the path for R front end
> 
>
> Key: SPARK-14692
> URL: https://issues.apache.org/jira/browse/SPARK-14692
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Mac OSX
>Reporter: Niranjan Molkeri`
>
> Trying to set Environment path for SparkR in RStudio. 
> Getting this bug. 
> > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> > library(SparkR)
> Error in library(SparkR) : there is no package called ‘SparkR’
> > sc <- sparkR.init(master="local")
> Error: could not find function "sparkR.init"
> In the directory which it is pointed. There is directory called SparkR. I 
> don't know how to proceed with this.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13448) Document MLlib behavior changes in Spark 2.0

2016-04-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13448:
--
Description: 
This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.

* SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
to 1e-6.
* SPARK-7780: Intercept will not be regularized if users train binary 
classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because 
it calls ML LogisticRegresson implementation. Meanwhile if users set without 
regularization, training with or without feature scaling will return the same 
solution by the same convergence rate(because they run the same code route), 
this behavior is different from the old API.
* SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
results
* SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
default, if checkpointing is being used.
* SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
not handle them correctly.
* SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and 
spark.mllib

  was:
This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.

* SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
to 1e-6.
* SPARK-7780: Intercept will not be regularized if users train binary 
classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because 
it calls ML LogisticRegresson implementation. Meanwhile if users set without 
regularization, training with or without feature scaling will return the same 
solution by the same convergence rate(because they run the same code route), 
this behavior is different from the old API.
* SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
results
* SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
default, if checkpointing is being used.
* SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
not handle them correctly.
* SPARK-10574: HashingTF uses MurmurHash3 by default


> Document MLlib behavior changes in Spark 2.0
> 
>
> Key: SPARK-13448
> URL: https://issues.apache.org/jira/browse/SPARK-13448
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
> remember to add them to the migration guide / release notes.
> * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
> to 1e-6.
> * SPARK-7780: Intercept will not be regularized if users train binary 
> classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, 
> because it calls ML LogisticRegresson implementation. Meanwhile if users set 
> without regularization, training with or without feature scaling will return 
> the same solution by the same convergence rate(because they run the same code 
> route), this behavior is different from the old API.
> * SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
> results
> * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
> default, if checkpointing is being used.
> * SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
> not handle them correctly.
> * SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and 
> spark.mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13448) Document MLlib behavior changes in Spark 2.0

2016-04-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13448:
--
Description: 
This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.

* SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
to 1e-6.
* SPARK-7780: Intercept will not be regularized if users train binary 
classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because 
it calls ML LogisticRegresson implementation. Meanwhile if users set without 
regularization, training with or without feature scaling will return the same 
solution by the same convergence rate(because they run the same code route), 
this behavior is different from the old API.
* SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
results
* SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
default, if checkpointing is being used.
* SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
not handle them correctly.
* SPARK-10574: HashingTF uses MurmurHash3 by default

  was:
This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
remember to add them to the migration guide / release notes.

* SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
to 1e-6.
* SPARK-7780: Intercept will not be regularized if users train binary 
classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because 
it calls ML LogisticRegresson implementation. Meanwhile if users set without 
regularization, training with or without feature scaling will return the same 
solution by the same convergence rate(because they run the same code route), 
this behavior is different from the old API.
* SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
results
* SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
default, if checkpointing is being used.
* SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
not handle them correctly.


> Document MLlib behavior changes in Spark 2.0
> 
>
> Key: SPARK-13448
> URL: https://issues.apache.org/jira/browse/SPARK-13448
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can 
> remember to add them to the migration guide / release notes.
> * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 
> to 1e-6.
> * SPARK-7780: Intercept will not be regularized if users train binary 
> classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, 
> because it calls ML LogisticRegresson implementation. Meanwhile if users set 
> without regularization, training with or without feature scaling will return 
> the same solution by the same convergence rate(because they run the same code 
> route), this behavior is different from the old API.
> * SPARK-12363: Bug fix for PowerIterationClustering which will likely change 
> results
> * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by 
> default, if checkpointing is being used.
> * SPARK-12153: Word2Vec now respects sentence boundaries.  Previously, it did 
> not handle them correctly.
> * SPARK-10574: HashingTF uses MurmurHash3 by default



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-19 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248514#comment-15248514
 ] 

Joseph K. Bradley commented on SPARK-10574:
---

Copying comments from [~simeons] from the PR:
{quote}
When the "hashing trick" is used in practice, it is important to do things such 
as monitor, manage or randomize collisions. If there are problems, it is not 
uncommon to vary the hashing function. All this suggests that a hashing 
function should be treated as an object with a simple interface, perhaps as 
simple as Function1[Any, Int]. Collision monitoring can then be performed with 
a decorator with an accumulator. Collision management would be performed by 
varying the seed or adding salt. Collision randomization would be performed by 
varying the seed/salt with each run and/or running multiple models in 
production which are identical expect for the different seed/salt used.

The hashing trick is very important in ML and quite... tricky... to get working 
well for complex, high-dimension spaces, which Spark is perfect for. An 
implementation that does not treat the hashing function as a first class object 
would substantially hinder MLlib's capabilities in practice.
{quote}
--> This initial PR should be a big improvement, even if we just use 
MurmurHash3 without varied seed/salts like you're suggesting.  This also seems 
acceptable for now since it's what scikit-learn does.  But later PRs could add 
further improvements.

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >