[jira] [Assigned] (SPARK-14745) CEP support in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14745: Assignee: (was: Apache Spark) > CEP support in Spark Streaming > -- > > Key: SPARK-14745 > URL: https://issues.apache.org/jira/browse/SPARK-14745 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Mario Briggs > Attachments: SparkStreamingCEP.pdf > > > Complex Event Processing is a often used feature in Streaming applications. > Spark Streaming current does not have a DSL/API for it. This JIRA is about > how/what can we add in Spark Streaming to support CEP out of the box -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14745) CEP support in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14745: Assignee: Apache Spark > CEP support in Spark Streaming > -- > > Key: SPARK-14745 > URL: https://issues.apache.org/jira/browse/SPARK-14745 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Mario Briggs >Assignee: Apache Spark > Attachments: SparkStreamingCEP.pdf > > > Complex Event Processing is a often used feature in Streaming applications. > Spark Streaming current does not have a DSL/API for it. This JIRA is about > how/what can we add in Spark Streaming to support CEP out of the box -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14745) CEP support in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249289#comment-15249289 ] Apache Spark commented on SPARK-14745: -- User 'agsachin' has created a pull request for this issue: https://github.com/apache/spark/pull/12518 > CEP support in Spark Streaming > -- > > Key: SPARK-14745 > URL: https://issues.apache.org/jira/browse/SPARK-14745 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Mario Briggs > Attachments: SparkStreamingCEP.pdf > > > Complex Event Processing is a often used feature in Streaming applications. > Spark Streaming current does not have a DSL/API for it. This JIRA is about > how/what can we add in Spark Streaming to support CEP out of the box -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14639) Add `bround` function in Python/R.
[ https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-14639. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12509 [https://github.com/apache/spark/pull/12509] > Add `bround` function in Python/R. > -- > > Key: SPARK-14639 > URL: https://issues.apache.org/jira/browse/SPARK-14639 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR >Reporter: Dongjoon Hyun > Fix For: 2.0.0 > > > This issue aims to expose Scala `bround` function in Python/R API. > `bround` function is implemented in SPARK-14614 by extending current `round` > function. > We used the following semantics from > [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java]. > {code} > public static double bround(double input, int scale) { > if (Double.isNaN(input) || Double.isInfinite(input)) { > return input; > } > return BigDecimal.valueOf(input).setScale(scale, > RoundingMode.HALF_EVEN).doubleValue(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14639) Add `bround` function in Python/R.
[ https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-14639: --- Assignee: Dongjoon Hyun > Add `bround` function in Python/R. > -- > > Key: SPARK-14639 > URL: https://issues.apache.org/jira/browse/SPARK-14639 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > This issue aims to expose Scala `bround` function in Python/R API. > `bround` function is implemented in SPARK-14614 by extending current `round` > function. > We used the following semantics from > [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java]. > {code} > public static double bround(double input, int scale) { > if (Double.isNaN(input) || Double.isInfinite(input)) { > return input; > } > return BigDecimal.valueOf(input).setScale(scale, > RoundingMode.HALF_EVEN).doubleValue(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14745) CEP support in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249261#comment-15249261 ] Mario Briggs edited comment on SPARK-14745 at 4/20/16 5:18 AM: --- Document with what is CEP, Examples, Features and possible API was (Author: mariobriggs): Examples, Features and possible API > CEP support in Spark Streaming > -- > > Key: SPARK-14745 > URL: https://issues.apache.org/jira/browse/SPARK-14745 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Mario Briggs > Attachments: SparkStreamingCEP.pdf > > > Complex Event Processing is a often used feature in Streaming applications. > Spark Streaming current does not have a DSL/API for it. This JIRA is about > how/what can we add in Spark Streaming to support CEP out of the box -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14745) CEP support in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mario Briggs updated SPARK-14745: - Attachment: SparkStreamingCEP.pdf Examples, Features and possible API > CEP support in Spark Streaming > -- > > Key: SPARK-14745 > URL: https://issues.apache.org/jira/browse/SPARK-14745 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Mario Briggs > Attachments: SparkStreamingCEP.pdf > > > Complex Event Processing is a often used feature in Streaming applications. > Spark Streaming current does not have a DSL/API for it. This JIRA is about > how/what can we add in Spark Streaming to support CEP out of the box -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14745) CEP support in Spark Streaming
Mario Briggs created SPARK-14745: Summary: CEP support in Spark Streaming Key: SPARK-14745 URL: https://issues.apache.org/jira/browse/SPARK-14745 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Mario Briggs Complex Event Processing is a often used feature in Streaming applications. Spark Streaming current does not have a DSL/API for it. This JIRA is about how/what can we add in Spark Streaming to support CEP out of the box -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14600) Push predicates through Expand
[ https://issues.apache.org/jira/browse/SPARK-14600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-14600. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12496 [https://github.com/apache/spark/pull/12496] > Push predicates through Expand > -- > > Key: SPARK-14600 > URL: https://issues.apache.org/jira/browse/SPARK-14600 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > Fix For: 2.0.0 > > > A grouping sets will be analyzed as Aggregate(Expand(Project)), the grouping > attributes came from Project, but have different meaning in Project (equal to > original grouping expression) and Expand (could be original grouping > expression or null), this does not make sense, because the attribute has > different result in different operator, > A better way could be Aggregate(Expand()), then we need to fix SQL > generation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14704) create accumulators in TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-14704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14704: Issue Type: Sub-task (was: Improvement) Parent: SPARK-14626 > create accumulators in TaskMetrics > -- > > Key: SPARK-14704 > URL: https://issues.apache.org/jira/browse/SPARK-14704 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14704) create accumulators in TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-14704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14704. - Resolution: Fixed Fix Version/s: 2.0.0 > create accumulators in TaskMetrics > -- > > Key: SPARK-14704 > URL: https://issues.apache.org/jira/browse/SPARK-14704 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13419) SubquerySuite should use checkAnswer rather than ScalaTest's assertResult
[ https://issues.apache.org/jira/browse/SPARK-13419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13419. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12269 [https://github.com/apache/spark/pull/12269] > SubquerySuite should use checkAnswer rather than ScalaTest's assertResult > - > > Key: SPARK-13419 > URL: https://issues.apache.org/jira/browse/SPARK-13419 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Fix For: 2.0.0 > > > This is blocked by being able to generate SQL for subqueries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14744) Put examples packaging on a diet
Marcelo Vanzin created SPARK-14744: -- Summary: Put examples packaging on a diet Key: SPARK-14744 URL: https://issues.apache.org/jira/browse/SPARK-14744 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 2.0.0 Reporter: Marcelo Vanzin Priority: Minor Currently the examples bring in a lot of external dependencies, ballooning the size of the Spark distribution packages. I'd like to propose two things to slim down these dependencies: - make all non-Spark, and also Spark Streaming, dependencies "provided". This means, especially for streaming connectors, that launching examples becomes more like launching real applications (where you need to figure out how to provide those dependencies, e.g. using {{--packages}}). - audit examples and remove those that don't provide a lot of value. For example, HBase is working on full-featured Spark bindings, based on code that has already been in use for a while before being merged into HBase. The HBase example in Spark is very bare bones and, in comparison, not really useful and in fact a little misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14743) Improve delegation token handling in secure clusters
Marcelo Vanzin created SPARK-14743: -- Summary: Improve delegation token handling in secure clusters Key: SPARK-14743 URL: https://issues.apache.org/jira/browse/SPARK-14743 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.0.0 Reporter: Marcelo Vanzin In a way, I'd consider this a parent bug of SPARK-7252. Spark's current support for delegation tokens is a little all over the place: - for HDFS, there's support for re-creating tokens if a principal and keytab are provided - for HBase and Hive, Spark will fetch delegation tokens so that apps can work in cluster mode, but will not re-create them, so apps that need those will stop working after 7 days - for anything else, Spark doesn't do anything. Lots of other services use delegation tokens, and supporting them as data sources in Spark becomes more complicated because of that. e.g., Kafka will (hopefully) soon support them. It would be nice if Spark had consistent support for handling delegation tokens regardless of who needs them. I'd list these as the requirements: - Spark to provide a generic interface for fetching delegation tokens. This would allow Spark's delegation token support to be extended using some plugin architecture (e.g. Java services), meaning Spark itself doesn't need to support every possible service out there. This would be used to fetch tokens when launching apps in cluster mode, and when a principal and a keytab are provided to Spark. - A way to manually update delegation tokens in Spark. For example, a new SparkContext API, or some configuration that tells Spark to monitor a file for changes and load tokens from said file. This would allow external applications to manage tokens outside of Spark and be able to update a running Spark application (think, for example, a job sever like Oozie, or something like Hive-on-Spark which manages Spark apps running remotely). - A way to notify running code that new delegation tokens have been loaded. This may not be strictly necessary; it might be possible for code to detect that, e.g., by peeking into the UserGroupInformation structure. But an event sent to the listener bus would allow applications to react when new tokens are available (e.g., the Hive backend could re-create connections to the metastore server using the new tokens). Also, cc'ing [~busbey] and [~steve_l] since you've talked about this in the mailing list recently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS
[ https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated SPARK-14521: - Summary: StackOverflowError in Kryo when executing TPC-DS (was: StackOverflowError in Kryo when executing TPC-DS Query27) > StackOverflowError in Kryo when executing TPC-DS > > > Key: SPARK-14521 > URL: https://issues.apache.org/jira/browse/SPARK-14521 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Rajesh Balamohan >Priority: Blocker > > Build details: Spark build from master branch (Apr-10) > DataSet:TPC-DS at 200 GB scale in Parq format stored in hive. > Client: $SPARK_HOME/bin/beeline > Query: TPC-DS Query27 > spark.sql.sources.fileScan=true (this is the default value anyways) > Exception: > {noformat} > Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13905) Change signature of as.data.frame() to be consistent with the R base package
[ https://issues.apache.org/jira/browse/SPARK-13905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-13905. --- Resolution: Fixed Assignee: Sun Rui Fix Version/s: 2.0.0 Resolved by https://github.com/apache/spark/pull/11811 > Change signature of as.data.frame() to be consistent with the R base package > > > Key: SPARK-13905 > URL: https://issues.apache.org/jira/browse/SPARK-13905 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui >Assignee: Sun Rui > Fix For: 2.0.0 > > > change the signature of as.data.frame() to be consistent with that in the R > base package to meet R user's convention, as documented at > http://www.inside-r.org/r-doc/base/as.data.frame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8327) Ganglia failed to start while starting standalone on EC 2 spark with spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-8327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Vladimirov closed SPARK-8327. -- Resolution: Not A Problem > Ganglia failed to start while starting standalone on EC 2 spark with > spark-ec2 > --- > > Key: SPARK-8327 > URL: https://issues.apache.org/jira/browse/SPARK-8327 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.3.1 >Reporter: Vladimir Vladimirov >Priority: Minor > > exception shown > {code} > [FAILED] Starting httpd: httpd: Syntax error on line 199 of > /etc/httpd/conf/httpd.conf: Cannot load modules/libphp-5.5.so into server: > /etc/httpd/modules/libphp-5.5.so: cannot open shared object file: No such > file or directory [FAILED] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14742) Redirect spark-ec2 doc to new location
Nicholas Chammas created SPARK-14742: Summary: Redirect spark-ec2 doc to new location Key: SPARK-14742 URL: https://issues.apache.org/jira/browse/SPARK-14742 Project: Spark Issue Type: Documentation Components: Documentation, EC2 Reporter: Nicholas Chammas Priority: Minor See: https://github.com/amplab/spark-ec2/pull/24#issuecomment-212033453 We need to redirect this page http://spark.apache.org/docs/latest/ec2-scripts.html to this page https://github.com/amplab/spark-ec2#readme -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8327) Ganglia failed to start while starting standalone on EC 2 spark with spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-8327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249144#comment-15249144 ] Nicholas Chammas commented on SPARK-8327: - [~vvladymyrov] - Is this still an issue? If so, I suggest migrating this issue to the spark-ec2 tracker: https://github.com/amplab/spark-ec2/issues > Ganglia failed to start while starting standalone on EC 2 spark with > spark-ec2 > --- > > Key: SPARK-8327 > URL: https://issues.apache.org/jira/browse/SPARK-8327 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.3.1 >Reporter: Vladimir Vladimirov >Priority: Minor > > exception shown > {code} > [FAILED] Starting httpd: httpd: Syntax error on line 199 of > /etc/httpd/conf/httpd.conf: Cannot load modules/libphp-5.5.so into server: > /etc/httpd/modules/libphp-5.5.so: cannot open shared object file: No such > file or directory [FAILED] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3
[ https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249141#comment-15249141 ] Nicholas Chammas commented on SPARK-6527: - Did the s3a suggestion work? If not, did anybody file an issue as Steve suggested with more detail? > sc.binaryFiles can not access files on s3 > - > > Key: SPARK-6527 > URL: https://issues.apache.org/jira/browse/SPARK-6527 > Project: Spark > Issue Type: Bug > Components: EC2, Input/Output >Affects Versions: 1.2.0, 1.3.0 > Environment: I am running Spark on EC2 >Reporter: Zhao Zhang >Priority: Minor > > The sc.binaryFIles() can not access the files stored on s3. It can correctly > list the number of files, but report "file does not exist" when processing > them. I also tried sc.textFile() which works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6527) sc.binaryFiles can not access files on s3
[ https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249141#comment-15249141 ] Nicholas Chammas edited comment on SPARK-6527 at 4/20/16 2:27 AM: -- Did the s3a suggestion work? If not, did anybody file an issue as Steve suggested? was (Author: nchammas): Did the s3a suggestion work? If not, did anybody file an issue as Steve suggested with more detail? > sc.binaryFiles can not access files on s3 > - > > Key: SPARK-6527 > URL: https://issues.apache.org/jira/browse/SPARK-6527 > Project: Spark > Issue Type: Bug > Components: EC2, Input/Output >Affects Versions: 1.2.0, 1.3.0 > Environment: I am running Spark on EC2 >Reporter: Zhao Zhang >Priority: Minor > > The sc.binaryFIles() can not access the files stored on s3. It can correctly > list the number of files, but report "file does not exist" when processing > them. I also tried sc.textFile() which works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14741) Streaming from partitioned directory structure captures unintended partition columns
[ https://issues.apache.org/jira/browse/SPARK-14741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14741: Assignee: Tathagata Das (was: Apache Spark) > Streaming from partitioned directory structure captures unintended partition > columns > > > Key: SPARK-14741 > URL: https://issues.apache.org/jira/browse/SPARK-14741 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Tathagata Das >Assignee: Tathagata Das > > Consider the following directory structure > dir/col=X/some-files > If we create a text format streaming dataframe on {dir/col=X/}, then it > should not consider as partitioning in columns. Even though the streaming > dataframe does not do so, the generated batch dataframes pick up col as a > partitioning columns, causing mismatch streaming source schema and generated > df schema. This leads to runtime failure: > 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: > Query query-0 terminated with error > java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14741) Streaming from partitioned directory structure captures unintended partition columns
[ https://issues.apache.org/jira/browse/SPARK-14741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14741: Assignee: Apache Spark (was: Tathagata Das) > Streaming from partitioned directory structure captures unintended partition > columns > > > Key: SPARK-14741 > URL: https://issues.apache.org/jira/browse/SPARK-14741 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Tathagata Das >Assignee: Apache Spark > > Consider the following directory structure > dir/col=X/some-files > If we create a text format streaming dataframe on {dir/col=X/}, then it > should not consider as partitioning in columns. Even though the streaming > dataframe does not do so, the generated batch dataframes pick up col as a > partitioning columns, causing mismatch streaming source schema and generated > df schema. This leads to runtime failure: > 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: > Query query-0 terminated with error > java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14741) Streaming from partitioned directory structure captures unintended partition columns
[ https://issues.apache.org/jira/browse/SPARK-14741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249139#comment-15249139 ] Apache Spark commented on SPARK-14741: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/12517 > Streaming from partitioned directory structure captures unintended partition > columns > > > Key: SPARK-14741 > URL: https://issues.apache.org/jira/browse/SPARK-14741 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Tathagata Das >Assignee: Tathagata Das > > Consider the following directory structure > dir/col=X/some-files > If we create a text format streaming dataframe on {dir/col=X/}, then it > should not consider as partitioning in columns. Even though the streaming > dataframe does not do so, the generated batch dataframes pick up col as a > partitioning columns, causing mismatch streaming source schema and generated > df schema. This leads to runtime failure: > 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: > Query query-0 terminated with error > java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14741) Streaming from partitioned directory structure captures unintended partition columns
Tathagata Das created SPARK-14741: - Summary: Streaming from partitioned directory structure captures unintended partition columns Key: SPARK-14741 URL: https://issues.apache.org/jira/browse/SPARK-14741 Project: Spark Issue Type: Bug Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Consider the following directory structure dir/col=X/some-files If we create a text format streaming dataframe on {dir/col=X/}, then it should not consider as partitioning in columns. Even though the streaming dataframe does not do so, the generated batch dataframes pick up col as a partitioning columns, causing mismatch streaming source schema and generated df schema. This leads to runtime failure: 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12148) SparkR: rename DataFrame to SparkDataFrame
[ https://issues.apache.org/jira/browse/SPARK-12148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249109#comment-15249109 ] Sun Rui commented on SPARK-12148: - [~felixcheung] Go ahead with this:) > SparkR: rename DataFrame to SparkDataFrame > -- > > Key: SPARK-12148 > URL: https://issues.apache.org/jira/browse/SPARK-12148 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Michael Lawrence > > The SparkR package represents a Spark DataFrame with the class "DataFrame". > That conflicts with the more general DataFrame class defined in the S4Vectors > package. Would it not be more appropriate to use the name "SparkDataFrame" > instead? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249084#comment-15249084 ] Apache Spark commented on SPARK-14739: -- User 'arashpa' has created a pull request for this issue: https://github.com/apache/spark/pull/12516 > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249080#comment-15249080 ] Vishnu Prasad commented on SPARK-14739: --- I've merged your PR with your test fixes. Thank you > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249078#comment-15249078 ] Arash Parsa commented on SPARK-14739: - Sorry wasn't able to pull from your branch. I submitted a new PR with proper updates. Please let me know how it looks. > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249077#comment-15249077 ] Apache Spark commented on SPARK-14739: -- User 'arashpa' has created a pull request for this issue: https://github.com/apache/spark/pull/12515 > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27
[ https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14521: Assignee: Apache Spark > StackOverflowError in Kryo when executing TPC-DS Query27 > > > Key: SPARK-14521 > URL: https://issues.apache.org/jira/browse/SPARK-14521 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Rajesh Balamohan >Assignee: Apache Spark >Priority: Blocker > > Build details: Spark build from master branch (Apr-10) > DataSet:TPC-DS at 200 GB scale in Parq format stored in hive. > Client: $SPARK_HOME/bin/beeline > Query: TPC-DS Query27 > spark.sql.sources.fileScan=true (this is the default value anyways) > Exception: > {noformat} > Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27
[ https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14521: Assignee: (was: Apache Spark) > StackOverflowError in Kryo when executing TPC-DS Query27 > > > Key: SPARK-14521 > URL: https://issues.apache.org/jira/browse/SPARK-14521 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Rajesh Balamohan >Priority: Blocker > > Build details: Spark build from master branch (Apr-10) > DataSet:TPC-DS at 200 GB scale in Parq format stored in hive. > Client: $SPARK_HOME/bin/beeline > Query: TPC-DS Query27 > spark.sql.sources.fileScan=true (this is the default value anyways) > Exception: > {noformat} > Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27
[ https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249067#comment-15249067 ] Apache Spark commented on SPARK-14521: -- User 'rajeshbalamohan' has created a pull request for this issue: https://github.com/apache/spark/pull/12514 > StackOverflowError in Kryo when executing TPC-DS Query27 > > > Key: SPARK-14521 > URL: https://issues.apache.org/jira/browse/SPARK-14521 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Rajesh Balamohan >Priority: Blocker > > Build details: Spark build from master branch (Apr-10) > DataSet:TPC-DS at 200 GB scale in Parq format stored in hive. > Client: $SPARK_HOME/bin/beeline > Query: TPC-DS Query27 > spark.sql.sources.fileScan=true (this is the default value anyways) > Exception: > {noformat} > Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249061#comment-15249061 ] Maciej Szymkiewicz commented on SPARK-14739: I extracted relevant test fixes and made PR against your branch. > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14571) Log instrumentation in ALS
[ https://issues.apache.org/jira/browse/SPARK-14571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249058#comment-15249058 ] Timothy Hunter commented on SPARK-14571: Yes, please feel free to take this task. Thanks! > Log instrumentation in ALS > -- > > Key: SPARK-14571 > URL: https://issues.apache.org/jira/browse/SPARK-14571 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Timothy Hunter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14740: - Description: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {noformat} from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) {noformat} I can get the regression coefficient out, but I can't get the regularization parameter {noformat} In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] {noformat} For the original issue raised on StackOverflow please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark was: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {noformat} from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) {noformat} I can get the regression coefficient out, but I can't get the regularization parameter {noformat} In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] {noformat} For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > {noformat} > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > {noformat} > I can get the regression coefficient out, but I can't get the regularization > parameter > {noformat} > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > {noformat} > For the original issue raised on StackOverflow please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was
[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14740: - Description: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {noformat} from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) {noformat} I can get the regression coefficient out, but I can't get the regularization parameter {noformat} In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] {noformat} For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark was: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {{ from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) }} I can get the regression coefficient out, but I can't get the regularization parameter {{ In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] }} For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > {noformat} > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > {noformat} > I can get the regression coefficient out, but I can't get the regularization > parameter > {noformat} > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > {noformat} > For a simple example please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14740: - Description: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {noformat} from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) {noformat} I can get the regression coefficient out, but I can't get the regularization parameter {noformat} In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] {noformat} For the original issue on StackOverflow please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark was: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. ``` from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) ``` I can get the regression coefficient out, but I can't get the regularization parameter ``` In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] ``` For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > {noformat} > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > {noformat} > I can get the regression coefficient out, but I can't get the regularization > parameter > {noformat} > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > {noformat} > For the original issue on StackOverflow please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14740: - Description: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {{ from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) }} I can get the regression coefficient out, but I can't get the regularization parameter {{ In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] }} For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark was: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. {noformat} from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) {noformat} I can get the regression coefficient out, but I can't get the regularization parameter {noformat} In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] {noformat} For the original issue on StackOverflow please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > {{ > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > }} > I can get the regression coefficient out, but I can't get the regularization > parameter > {{ > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > }} > For a simple example please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
[ https://issues.apache.org/jira/browse/SPARK-14740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Shearer updated SPARK-14740: - Description: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. ``` from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) ``` I can get the regression coefficient out, but I can't get the regularization parameter ``` In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] ``` For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark was: If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. ` from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) ` I can get the regression coefficient out, but I can't get the regularization parameter ` In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] ` For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark > CrossValidatorModel.bestModel does not include hyper-parameters > --- > > Key: SPARK-14740 > URL: https://issues.apache.org/jira/browse/SPARK-14740 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Paul Shearer > > If you tune hyperparameters using a CrossValidator object in PySpark, you may > not be able to extract the parameter values of the best model. > ``` > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.mllib.linalg import Vectors > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator > dataset = sqlContext.createDataFrame( > [(Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0)] * 10, > ["features", "label"]) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, > 0.0001]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, > evaluator=evaluator) > cvModel = cv.fit(dataset) > ``` > I can get the regression coefficient out, but I can't get the regularization > parameter > ``` > In [3]: cvModel.bestModel.coefficients > Out[3]: DenseVector([3.1573]) > In [4]: cvModel.bestModel.explainParams() > Out[4]: '' > In [5]: cvModel.bestModel.extractParamMap() > Out[5]: {} > In [15]: cvModel.params > Out[15]: [] > In [36]: cvModel.bestModel.params > Out[36]: [] > ``` > For a simple example please see > http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Created] (SPARK-14740) CrossValidatorModel.bestModel does not include hyper-parameters
Paul Shearer created SPARK-14740: Summary: CrossValidatorModel.bestModel does not include hyper-parameters Key: SPARK-14740 URL: https://issues.apache.org/jira/browse/SPARK-14740 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1 Reporter: Paul Shearer If you tune hyperparameters using a CrossValidator object in PySpark, you may not be able to extract the parameter values of the best model. ` from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.mllib.linalg import Vectors from pyspark.ml.tuning import ParamGridBuilder, CrossValidator dataset = sqlContext.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) ` I can get the regression coefficient out, but I can't get the regularization parameter ` In [3]: cvModel.bestModel.coefficients Out[3]: DenseVector([3.1573]) In [4]: cvModel.bestModel.explainParams() Out[4]: '' In [5]: cvModel.bestModel.extractParamMap() Out[5]: {} In [15]: cvModel.params Out[15]: [] In [36]: cvModel.bestModel.params Out[36]: [] ` For a simple example please see http://stackoverflow.com/questions/36697304/how-to-extract-model-hyper-parameters-from-spark-ml-in-pyspark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249053#comment-15249053 ] Maciej Szymkiewicz edited comment on SPARK-14739 at 4/20/16 12:47 AM: -- Sure, but your latest PR still doesn't resolve problem with dead tests. Instead of copying you could actually pull changes from my repo. was (Author: zero323): Sure, but your latest PR still doesn't resolve problem with dead tests. > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249053#comment-15249053 ] Maciej Szymkiewicz commented on SPARK-14739: Sure, but your latest PR still doesn't resolve problem with dead tests. > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13929) Use Scala reflection for UDFs
[ https://issues.apache.org/jira/browse/SPARK-13929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13929. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12149 [https://github.com/apache/spark/pull/12149] > Use Scala reflection for UDFs > - > > Key: SPARK-13929 > URL: https://issues.apache.org/jira/browse/SPARK-13929 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jakob Odersky >Priority: Minor > Fix For: 2.0.0 > > > {{ScalaReflection}} uses native Java reflection for User Defined Types which > would fail if such types are not plain Scala classes that map 1:1 to Java. > Consider the following extract (from here > https://github.com/apache/spark/blob/92024797a4fad594b5314f3f3be5c6be2434de8a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L376 > ): > {code} > case t if Utils.classIsLoadable(className) && > Utils.classForName(className).isAnnotationPresent(classOf[SQLUserDefinedType]) > => > val udt = > Utils.classForName(className).getAnnotation(classOf[SQLUserDefinedType]).udt().newInstance() > //... > {code} > If {{t}}'s runtime class is actually synthetic (something that doesn't exist > in Java and hence uses a dollar sign internally), such as nested classes or > package objects, the above code will fail. > Currently there are no known use-cases of synthetic user-defined types (hence > the minor priority), however it would be best practice to remove plain Java > reflection and rely on Scala reflection instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249036#comment-15249036 ] Apache Spark commented on SPARK-14739: -- User 'vishnu667' has created a pull request for this issue: https://github.com/apache/spark/pull/12513 > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14735) PySpark HashingTF hashAlgorithm param + docs
[ https://issues.apache.org/jira/browse/SPARK-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249034#comment-15249034 ] zhengruifeng commented on SPARK-14735: -- I can work on this after SPARK-10574 is resolved > PySpark HashingTF hashAlgorithm param + docs > > > Key: SPARK-14735 > URL: https://issues.apache.org/jira/browse/SPARK-14735 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib, PySpark >Reporter: Joseph K. Bradley > > Add hashAlgorithm param to HashingTF in PySpark, and update docs to indicate > default algorithm is MumurHash3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14407) Hide HadoopFsRelation related data source API to execution package
[ https://issues.apache.org/jira/browse/SPARK-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14407. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12361 [https://github.com/apache/spark/pull/12361] > Hide HadoopFsRelation related data source API to execution package > -- > > Key: SPARK-14407 > URL: https://issues.apache.org/jira/browse/SPARK-14407 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27
[ https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249012#comment-15249012 ] Rajesh Balamohan edited comment on SPARK-14521 at 4/20/16 12:31 AM: Update: - By default, spark-thrift server disables "spark.kryo.referenceTracking" (if not specified in conf). https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L55 - When "spark.kryo.referenceTracking" is set to true explicitly in spark-defaults.conf, query executes successfully. Alternatively, if "spark.sql.autoBroadcastJoinThreshold" can be set to a very low value in order to prevent broacasting (this is done just for verification). - Recent changes LongHashedRelation could have introduced loops which would need "spark.kryo.referenceTracking=true" in spark-thrift server. I will create a PR for this. was (Author: rajesh.balamohan): Update: - By default, spark-thrift server disables "spark.kryo.referenceTracking". https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L55 - When "spark.kryo.referenceTracking" is set to true explicitly in spark-defaults.conf, query executes successfully. Alternatively, if "spark.sql.autoBroadcastJoinThreshold" can be set to a very low value in order to prevent broacasting (this is done just for verification). - Recent changes LongHashedRelation could have introduced loops which would need "spark.kryo.referenceTracking=true" in spark-thrift server. I will create a PR for this. > StackOverflowError in Kryo when executing TPC-DS Query27 > > > Key: SPARK-14521 > URL: https://issues.apache.org/jira/browse/SPARK-14521 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Rajesh Balamohan >Priority: Blocker > > Build details: Spark build from master branch (Apr-10) > DataSet:TPC-DS at 200 GB scale in Parq format stored in hive. > Client: $SPARK_HOME/bin/beeline > Query: TPC-DS Query27 > spark.sql.sources.fileScan=true (this is the default value anyways) > Exception: > {noformat} > Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at >
[jira] [Resolved] (SPARK-14717) Scala, Python APIs for Dataset.unpersist differ in default blocking value
[ https://issues.apache.org/jira/browse/SPARK-14717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-14717. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12507 [https://github.com/apache/spark/pull/12507] > Scala, Python APIs for Dataset.unpersist differ in default blocking value > - > > Key: SPARK-14717 > URL: https://issues.apache.org/jira/browse/SPARK-14717 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Felix Cheung >Priority: Minor > Fix For: 2.0.0 > > > In Scala/Java {{Dataset.unpersist()}} sets blocking = false by default, but > in Python, it is set to True by default. We should presumably make them > consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249028#comment-15249028 ] Arash Parsa commented on SPARK-14739: - [~zero323] sure I can adjust my PR (move the tests), but since I found the bug do you think I should be getting the fix in? > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249020#comment-15249020 ] Apache Spark commented on SPARK-14739: -- User 'vishnu667' has created a pull request for this issue: https://github.com/apache/spark/pull/12512 > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS Query27
[ https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249012#comment-15249012 ] Rajesh Balamohan commented on SPARK-14521: -- Update: - By default, spark-thrift server disables "spark.kryo.referenceTracking". https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLEnv.scala#L55 - When "spark.kryo.referenceTracking" is set to true explicitly in spark-defaults.conf, query executes successfully. Alternatively, if "spark.sql.autoBroadcastJoinThreshold" can be set to a very low value in order to prevent broacasting (this is done just for verification). - Recent changes LongHashedRelation could have introduced loops which would need "spark.kryo.referenceTracking=true" in spark-thrift server. I will create a PR for this. > StackOverflowError in Kryo when executing TPC-DS Query27 > > > Key: SPARK-14521 > URL: https://issues.apache.org/jira/browse/SPARK-14521 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Rajesh Balamohan >Priority: Blocker > > Build details: Spark build from master branch (Apr-10) > DataSet:TPC-DS at 200 GB scale in Parq format stored in hive. > Client: $SPARK_HOME/bin/beeline > Query: TPC-DS Query27 > spark.sql.sources.fileScan=true (this is the default value anyways) > Exception: > {noformat} > Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-14639) Add `bround` function in Python/R.
[ https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-14639: -- Component/s: SparkR PySpark > Add `bround` function in Python/R. > -- > > Key: SPARK-14639 > URL: https://issues.apache.org/jira/browse/SPARK-14639 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR >Reporter: Dongjoon Hyun > > This issue aims to expose Scala `bround` function in Python/R API. > `bround` function is implemented in SPARK-14614 by extending current `round` > function. > We used the following semantics from > [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java]. > {code} > public static double bround(double input, int scale) { > if (Double.isNaN(input) || Double.isInfinite(input)) { > return input; > } > return BigDecimal.valueOf(input).setScale(scale, > RoundingMode.HALF_EVEN).doubleValue(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248994#comment-15248994 ] Maciej Szymkiewicz commented on SPARK-14739: This solves only small part of the problem. Right now both Sparse and Dense vector parsing is broken, not to mention corresponding tests are dead code. > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248995#comment-15248995 ] Apache Spark commented on SPARK-14739: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/12511 > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248973#comment-15248973 ] Arash Parsa commented on SPARK-14739: - Thanks for posting the ticket on Jira, I created the PR here: https://github.com/apache/spark/pull/12510 > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14739: Assignee: (was: Apache Spark) > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248972#comment-15248972 ] Apache Spark commented on SPARK-14739: -- User 'arashpa' has created a pull request for this issue: https://github.com/apache/spark/pull/12510 > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
[ https://issues.apache.org/jira/browse/SPARK-14739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14739: Assignee: Apache Spark > Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with > no indices > --- > > Key: SPARK-14739 > URL: https://issues.apache.org/jira/browse/SPARK-14739 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark > > DenseVector: > {code} > Vectors.parse(str(Vectors.dense([]))) > ## ValueErrorTraceback (most recent call last) > ## .. > ## ValueError: Unable to parse values from > {code} > SparseVector: > {code} > Vectors.parse(str(Vectors.sparse(5, [], []))) > ## ValueErrorTraceback (most recent call last) > ## ... > ## ValueError: Unable to parse indices from . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14719) WriteAheadLogBasedBlockHandler should ignore BlockManager put errors
[ https://issues.apache.org/jira/browse/SPARK-14719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-14719: --- Target Version/s: (was: 2.0.0) > WriteAheadLogBasedBlockHandler should ignore BlockManager put errors > > > Key: SPARK-14719 > URL: https://issues.apache.org/jira/browse/SPARK-14719 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Josh Rosen >Assignee: Josh Rosen > > {{WriteAheadLogBasedBlockHandler}} will currently throw exceptions if > BlockManager puts fail, even though those puts are only performed as a > performance optimization. Instead, it should log and ignore exceptions > originating from the block manager put. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14719) WriteAheadLogBasedBlockHandler should ignore BlockManager put errors
[ https://issues.apache.org/jira/browse/SPARK-14719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-14719. Resolution: Won't Fix > WriteAheadLogBasedBlockHandler should ignore BlockManager put errors > > > Key: SPARK-14719 > URL: https://issues.apache.org/jira/browse/SPARK-14719 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Josh Rosen >Assignee: Josh Rosen > > {{WriteAheadLogBasedBlockHandler}} will currently throw exceptions if > BlockManager puts fail, even though those puts are only performed as a > performance optimization. Instead, it should log and ignore exceptions > originating from the block manager put. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14719) WriteAheadLogBasedBlockHandler should ignore BlockManager put errors
[ https://issues.apache.org/jira/browse/SPARK-14719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-14719: --- Fix Version/s: (was: 2.0.0) > WriteAheadLogBasedBlockHandler should ignore BlockManager put errors > > > Key: SPARK-14719 > URL: https://issues.apache.org/jira/browse/SPARK-14719 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Josh Rosen >Assignee: Josh Rosen > > {{WriteAheadLogBasedBlockHandler}} will currently throw exceptions if > BlockManager puts fail, even though those puts are only performed as a > performance optimization. Instead, it should log and ignore exceptions > originating from the block manager put. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-14719) WriteAheadLogBasedBlockHandler should ignore BlockManager put errors
[ https://issues.apache.org/jira/browse/SPARK-14719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reopened SPARK-14719: Reverted this patch and am marking as "won't fix" for now. See discussion on PR for explanation of why I am reluctant to modify this code now. > WriteAheadLogBasedBlockHandler should ignore BlockManager put errors > > > Key: SPARK-14719 > URL: https://issues.apache.org/jira/browse/SPARK-14719 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Josh Rosen >Assignee: Josh Rosen > > {{WriteAheadLogBasedBlockHandler}} will currently throw exceptions if > BlockManager puts fail, even though those puts are only performed as a > performance optimization. Instead, it should log and ignore exceptions > originating from the block manager put. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14739) Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices
Maciej Szymkiewicz created SPARK-14739: -- Summary: Vectors.parse doesn't handle dense vectors of size 0 and sparse vectros with no indices Key: SPARK-14739 URL: https://issues.apache.org/jira/browse/SPARK-14739 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.6.0, 2.0.0 Reporter: Maciej Szymkiewicz DenseVector: {code} Vectors.parse(str(Vectors.dense([]))) ## ValueErrorTraceback (most recent call last) ## .. ## ValueError: Unable to parse values from {code} SparseVector: {code} Vectors.parse(str(Vectors.sparse(5, [], []))) ## ValueErrorTraceback (most recent call last) ## ... ## ValueError: Unable to parse indices from . {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14571) Log instrumentation in ALS
[ https://issues.apache.org/jira/browse/SPARK-14571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248859#comment-15248859 ] Miao Wang commented on SPARK-14571: --- Since nobody takes this one, can I learn it and give it a try? Thanks! Miao > Log instrumentation in ALS > -- > > Key: SPARK-14571 > URL: https://issues.apache.org/jira/browse/SPARK-14571 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Timothy Hunter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12224) R support for JDBC source
[ https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-12224. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10480 [https://github.com/apache/spark/pull/10480] > R support for JDBC source > - > > Key: SPARK-12224 > URL: https://issues.apache.org/jira/browse/SPARK-12224 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12224) R support for JDBC source
[ https://issues.apache.org/jira/browse/SPARK-12224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-12224: -- Assignee: Felix Cheung (was: Apache Spark) > R support for JDBC source > - > > Key: SPARK-12224 > URL: https://issues.apache.org/jira/browse/SPARK-12224 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14733) Allow custom timing control in microbenchmarks
[ https://issues.apache.org/jira/browse/SPARK-14733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14733. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.0.0 > Allow custom timing control in microbenchmarks > -- > > Key: SPARK-14733 > URL: https://issues.apache.org/jira/browse/SPARK-14733 > Project: Spark > Issue Type: Improvement >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.0.0 > > > The current benchmark framework runs a code block for several iterations and > reports statistics. However there is no way to exclude per-iteration setup > time from the overall results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14639) Add `bround` function in Python/R.
[ https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248848#comment-15248848 ] Apache Spark commented on SPARK-14639: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/12509 > Add `bround` function in Python/R. > -- > > Key: SPARK-14639 > URL: https://issues.apache.org/jira/browse/SPARK-14639 > Project: Spark > Issue Type: Improvement >Reporter: Dongjoon Hyun > > This issue aims to expose Scala `bround` function in Python/R API. > `bround` function is implemented in SPARK-14614 by extending current `round` > function. > We used the following semantics from > [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java]. > {code} > public static double bround(double input, int scale) { > if (Double.isNaN(input) || Double.isInfinite(input)) { > return input; > } > return BigDecimal.valueOf(input).setScale(scale, > RoundingMode.HALF_EVEN).doubleValue(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14639) Add `bround` function in Python/R.
[ https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14639: Assignee: (was: Apache Spark) > Add `bround` function in Python/R. > -- > > Key: SPARK-14639 > URL: https://issues.apache.org/jira/browse/SPARK-14639 > Project: Spark > Issue Type: Improvement >Reporter: Dongjoon Hyun > > This issue aims to expose Scala `bround` function in Python/R API. > `bround` function is implemented in SPARK-14614 by extending current `round` > function. > We used the following semantics from > [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java]. > {code} > public static double bround(double input, int scale) { > if (Double.isNaN(input) || Double.isInfinite(input)) { > return input; > } > return BigDecimal.valueOf(input).setScale(scale, > RoundingMode.HALF_EVEN).doubleValue(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14639) Add `bround` function in Python/R.
[ https://issues.apache.org/jira/browse/SPARK-14639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14639: Assignee: Apache Spark > Add `bround` function in Python/R. > -- > > Key: SPARK-14639 > URL: https://issues.apache.org/jira/browse/SPARK-14639 > Project: Spark > Issue Type: Improvement >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue aims to expose Scala `bround` function in Python/R API. > `bround` function is implemented in SPARK-14614 by extending current `round` > function. > We used the following semantics from > [Hive|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/RoundUtils.java]. > {code} > public static double bround(double input, int scale) { > if (Double.isNaN(input) || Double.isInfinite(input)) { > return input; > } > return BigDecimal.valueOf(input).setScale(scale, > RoundingMode.HALF_EVEN).doubleValue(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14478) Should StandardScaler use biased variance to scale?
[ https://issues.apache.org/jira/browse/SPARK-14478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248825#comment-15248825 ] Joseph K. Bradley commented on SPARK-14478: --- Adding a param seems reasonable, though probably pretty low priority. To make a judgement call...how about we leave it as is for now? I'll send a PR to document that it's using unbiased variance. If any user ever needs biased, then we can add the Param (but I've never heard anyone except myself complain). > Should StandardScaler use biased variance to scale? > --- > > Key: SPARK-14478 > URL: https://issues.apache.org/jira/browse/SPARK-14478 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Reporter: Joseph K. Bradley > > Currently, MLlib's StandardScaler scales columns using the unbiased standard > deviation. This matches what R's scale package does. > However, it is a bit odd for 2 reasons: > * Optimization/ML algorithms which require scaled columns generally assume > unit variance (for mathematical convenience). That requires using biased > variance. > * scikit-learn, MLlib's GLMs, and R's glmnet package all use biased variance. > *Question*: Should we switch to unbiased? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248822#comment-15248822 ] Joseph K. Bradley commented on SPARK-8884: -- I'm not sure this will make 2.0, so I'm changing the target to 2.1. [~mengxr] please retarget if needed. > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test
[ https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8884: - Target Version/s: 2.1.0 (was: 2.0.0) > 1-sample Anderson-Darling Goodness-of-Fit test > -- > > Key: SPARK-8884 > URL: https://issues.apache.org/jira/browse/SPARK-8884 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jose Cambronero > > We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add > to the current hypothesis testing functionality. The current implementation > supports various distributions (normal, exponential, gumbel, logistic, and > weibull). However, users must provide distribution parameters for all except > normal/exponential (in which case they are estimated from the data). In > contrast to other tests, such as the Kolmogorov Smirnov test, we only support > specific distributions as the critical values depend on the distribution > being tested. > The distributed implementation of AD takes advantage of the fact that we can > calculate a portion of the statistic within each partition of a sorted data > set, independent of the global order of those observations. We can then carry > some additional information that allows us to adjust the final amounts once > we have collected 1 result per partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10388) Public dataset loader interface
[ https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10388: -- Target Version/s: 2.1.0 (was: 2.0.0) > Public dataset loader interface > --- > > Key: SPARK-10388 > URL: https://issues.apache.org/jira/browse/SPARK-10388 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng > Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf > > > It is very useful to have a public dataset loader to fetch ML datasets from > popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, > requirements, and initial implementation. > {code} > val loader = new DatasetLoader(sqlContext) > val df = loader.get("libsvm", "rcv1_train.binary") > {code} > User should be able to list (or preview) datasets, e.g. > {code} > val datasets = loader.ls("libsvm") // returns a local DataFrame > datasets.show() // list all datasets under libsvm repo > {code} > It would be nice to allow 3rd-party packages to register new repos. Both the > API and implementation are pending discussion. Note that this requires http > and https support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4226) SparkSQL - Add support for subqueries in predicates
[ https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-4226. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12306 [https://github.com/apache/spark/pull/12306] > SparkSQL - Add support for subqueries in predicates > --- > > Key: SPARK-4226 > URL: https://issues.apache.org/jira/browse/SPARK-4226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 > Environment: Spark 1.2 snapshot >Reporter: Terry Siu > Fix For: 2.0.0 > > > I have a test table defined in Hive as follows: > {code:sql} > CREATE TABLE sparkbug ( > id INT, > event STRING > ) STORED AS PARQUET; > {code} > and insert some sample data with ids 1, 2, 3. > In a Spark shell, I then create a HiveContext and then execute the following > HQL to test out subquery predicates: > {code} > val hc = HiveContext(hc) > hc.hql("select customerid from sparkbug where customerid in (select > customerid from sparkbug where customerid in (2,3))") > {code} > I get the following error: > {noformat} > java.lang.RuntimeException: Unsupported language features in query: select > customerid from sparkbug where customerid in (select customerid from sparkbug > where customerid in (2,3)) > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > sparkbug > TOK_INSERT > TOK_DESTINATION > TOK_DIR > TOK_TMP_FILE > TOK_SELECT > TOK_SELEXPR > TOK_TABLE_OR_COL > customerid > TOK_WHERE > TOK_SUBQUERY_EXPR > TOK_SUBQUERY_OP > in > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > sparkbug > TOK_INSERT > TOK_DESTINATION > TOK_DIR > TOK_TMP_FILE > TOK_SELECT > TOK_SELEXPR > TOK_TABLE_OR_COL > customerid > TOK_WHERE > TOK_FUNCTION > in > TOK_TABLE_OR_COL > customerid > 2 > 3 > TOK_TABLE_OR_COL > customerid > scala.NotImplementedError: No parse rules for ASTNode type: 817, text: > TOK_SUBQUERY_EXPR : > TOK_SUBQUERY_EXPR > TOK_SUBQUERY_OP > in > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > sparkbug > TOK_INSERT > TOK_DESTINATION > TOK_DIR > TOK_TMP_FILE > TOK_SELECT > TOK_SELEXPR > TOK_TABLE_OR_COL > customerid > TOK_WHERE > TOK_FUNCTION > in > TOK_TABLE_OR_COL > customerid > 2 > 3 > TOK_TABLE_OR_COL > customerid > " + > > org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098) > > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) > at > scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > {noformat} > [This > thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html] > also brings up lack of subquery support in SparkSQL. It would be nice to > have subquery predicate support in a near, future release (1.3, maybe?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14738) Separate Docker Integration Tests from main spark build
[ https://issues.apache.org/jira/browse/SPARK-14738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248788#comment-15248788 ] Apache Spark commented on SPARK-14738: -- User 'lresende' has created a pull request for this issue: https://github.com/apache/spark/pull/12508 > Separate Docker Integration Tests from main spark build > --- > > Key: SPARK-14738 > URL: https://issues.apache.org/jira/browse/SPARK-14738 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Reporter: Luciano Resende > > Currently docker integration tests are run as part of the main build, but it > requires dev machines to have all the required docker installation setup > which most of cases is not available and thus the tests will fail. > this would separate the tests from the main spark build, and make them > optional which then could be invoked manually or as part of CI tests -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14738) Separate Docker Integration Tests from main spark build
[ https://issues.apache.org/jira/browse/SPARK-14738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14738: Assignee: (was: Apache Spark) > Separate Docker Integration Tests from main spark build > --- > > Key: SPARK-14738 > URL: https://issues.apache.org/jira/browse/SPARK-14738 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Reporter: Luciano Resende > > Currently docker integration tests are run as part of the main build, but it > requires dev machines to have all the required docker installation setup > which most of cases is not available and thus the tests will fail. > this would separate the tests from the main spark build, and make them > optional which then could be invoked manually or as part of CI tests -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14738) Separate Docker Integration Tests from main spark build
[ https://issues.apache.org/jira/browse/SPARK-14738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14738: Assignee: Apache Spark > Separate Docker Integration Tests from main spark build > --- > > Key: SPARK-14738 > URL: https://issues.apache.org/jira/browse/SPARK-14738 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Reporter: Luciano Resende >Assignee: Apache Spark > > Currently docker integration tests are run as part of the main build, but it > requires dev machines to have all the required docker installation setup > which most of cases is not available and thus the tests will fail. > this would separate the tests from the main spark build, and make them > optional which then could be invoked manually or as part of CI tests -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14738) Separate Docker Integration Tests from main spark build
Luciano Resende created SPARK-14738: --- Summary: Separate Docker Integration Tests from main spark build Key: SPARK-14738 URL: https://issues.apache.org/jira/browse/SPARK-14738 Project: Spark Issue Type: Bug Components: Build, SQL Reporter: Luciano Resende Currently docker integration tests are run as part of the main build, but it requires dev machines to have all the required docker installation setup which most of cases is not available and thus the tests will fail. this would separate the tests from the main spark build, and make them optional which then could be invoked manually or as part of CI tests -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14717) Scala, Python APIs for Dataset.unpersist differ in default blocking value
[ https://issues.apache.org/jira/browse/SPARK-14717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-14717: --- Assignee: Felix Cheung > Scala, Python APIs for Dataset.unpersist differ in default blocking value > - > > Key: SPARK-14717 > URL: https://issues.apache.org/jira/browse/SPARK-14717 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Felix Cheung >Priority: Minor > > In Scala/Java {{Dataset.unpersist()}} sets blocking = false by default, but > in Python, it is set to True by default. We should presumably make them > consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14717) Scala, Python APIs for Dataset.unpersist differ in default blocking value
[ https://issues.apache.org/jira/browse/SPARK-14717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14717: Assignee: Apache Spark > Scala, Python APIs for Dataset.unpersist differ in default blocking value > - > > Key: SPARK-14717 > URL: https://issues.apache.org/jira/browse/SPARK-14717 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > In Scala/Java {{Dataset.unpersist()}} sets blocking = false by default, but > in Python, it is set to True by default. We should presumably make them > consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14717) Scala, Python APIs for Dataset.unpersist differ in default blocking value
[ https://issues.apache.org/jira/browse/SPARK-14717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14717: Assignee: (was: Apache Spark) > Scala, Python APIs for Dataset.unpersist differ in default blocking value > - > > Key: SPARK-14717 > URL: https://issues.apache.org/jira/browse/SPARK-14717 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley >Priority: Minor > > In Scala/Java {{Dataset.unpersist()}} sets blocking = false by default, but > in Python, it is set to True by default. We should presumably make them > consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14717) Scala, Python APIs for Dataset.unpersist differ in default blocking value
[ https://issues.apache.org/jira/browse/SPARK-14717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248721#comment-15248721 ] Apache Spark commented on SPARK-14717: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/12507 > Scala, Python APIs for Dataset.unpersist differ in default blocking value > - > > Key: SPARK-14717 > URL: https://issues.apache.org/jira/browse/SPARK-14717 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley >Priority: Minor > > In Scala/Java {{Dataset.unpersist()}} sets blocking = false by default, but > in Python, it is set to True by default. We should presumably make them > consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14042) Add support for custom coalescers
[ https://issues.apache.org/jira/browse/SPARK-14042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14042. - Resolution: Fixed Assignee: Nezih Yigitbasi Fix Version/s: 2.0.0 > Add support for custom coalescers > - > > Key: SPARK-14042 > URL: https://issues.apache.org/jira/browse/SPARK-14042 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Nezih Yigitbasi >Assignee: Nezih Yigitbasi > Fix For: 2.0.0 > > > Per our discussion on the mailing list (please see > [here|http://mail-archives.apache.org/mod_mbox//spark-dev/201602.mbox/%3CCA+g63F7aVRBH=WyyK3nvBSLCMPtSdUuL_Ge9=ww4dnmnvy4...@mail.gmail.com%3E]) > it would be nice to specify a custom coalescing policy as the current > {{coalesce()}} method only allows the user to specify the number of > partitions and we cannot really control much. The need for this feature > popped up when I wanted to merge small files by coalescing them by size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14656) Benchmark.getPorcessorName() always return "Unknown processor" on Linux
[ https://issues.apache.org/jira/browse/SPARK-14656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-14656. --- Resolution: Fixed Assignee: Kazuaki Ishizaki Fix Version/s: 2.0.0 > Benchmark.getPorcessorName() always return "Unknown processor" on Linux > --- > > Key: SPARK-14656 > URL: https://issues.apache.org/jira/browse/SPARK-14656 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Critical > Fix For: 2.0.0 > > > When we call org.apache.spark.util.Benchmark.getPorcessorName() on Linux, it > always return {{"Unknown processor"}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode
[ https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14736: Assignee: (was: Apache Spark) > Deadlock in registering applications while the Master is in the RECOVERING > mode > --- > > Key: SPARK-14736 > URL: https://issues.apache.org/jira/browse/SPARK-14736 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.0, 1.6.0 > Environment: unix, Spark cluster with a custom > StandaloneRecoveryModeFactory and a custom PersistenceEngine >Reporter: niranda perera >Priority: Critical > > I have encountered the following issue in the standalone recovery mode. > Let's say there was an application A running in the cluster. Due to some > issue, the entire cluster, together with the application A goes down. > Then later on, cluster comes back online, and the master then goes into the > 'recovering' mode, because it sees some apps, workers and drivers have > already been in the cluster from Persistence Engine. While in the recovery > process, the application comes back online, but now it would have a different > ID, let's say B. > But then, as per the master, application registration logic, this application > B will NOT be added to the 'waitingApps' with the message ""Attempted to > re-register application at same address". [1] > private def registerApplication(app: ApplicationInfo): Unit = { > val appAddress = app.driver.address > if (addressToApp.contains(appAddress)) { > logInfo("Attempted to re-register application at same address: " + > appAddress) > return > } > The problem here is, master is trying to recover application A, which is not > in there anymore. Therefore after the recovery process, app A will be > dropped. However app A's successor, app B was also omitted from the > 'waitingApps' list because it had the same address as App A previously. > This creates a deadlock in the cluster, app A nor app B is available in the > cluster. > When the master is in the RECOVERING mode, shouldn't it add all the > registering apps to a list first, and then after the recovery is completed > (once the unsuccessful recoveries are removed), deploy the apps which are new? > This would sort this deadlock IMO? > [1] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode
[ https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14736: Assignee: Apache Spark > Deadlock in registering applications while the Master is in the RECOVERING > mode > --- > > Key: SPARK-14736 > URL: https://issues.apache.org/jira/browse/SPARK-14736 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.0, 1.6.0 > Environment: unix, Spark cluster with a custom > StandaloneRecoveryModeFactory and a custom PersistenceEngine >Reporter: niranda perera >Assignee: Apache Spark >Priority: Critical > > I have encountered the following issue in the standalone recovery mode. > Let's say there was an application A running in the cluster. Due to some > issue, the entire cluster, together with the application A goes down. > Then later on, cluster comes back online, and the master then goes into the > 'recovering' mode, because it sees some apps, workers and drivers have > already been in the cluster from Persistence Engine. While in the recovery > process, the application comes back online, but now it would have a different > ID, let's say B. > But then, as per the master, application registration logic, this application > B will NOT be added to the 'waitingApps' with the message ""Attempted to > re-register application at same address". [1] > private def registerApplication(app: ApplicationInfo): Unit = { > val appAddress = app.driver.address > if (addressToApp.contains(appAddress)) { > logInfo("Attempted to re-register application at same address: " + > appAddress) > return > } > The problem here is, master is trying to recover application A, which is not > in there anymore. Therefore after the recovery process, app A will be > dropped. However app A's successor, app B was also omitted from the > 'waitingApps' list because it had the same address as App A previously. > This creates a deadlock in the cluster, app A nor app B is available in the > cluster. > When the master is in the RECOVERING mode, shouldn't it add all the > registering apps to a list first, and then after the recovery is completed > (once the unsuccessful recoveries are removed), deploy the apps which are new? > This would sort this deadlock IMO? > [1] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode
[ https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248693#comment-15248693 ] Apache Spark commented on SPARK-14736: -- User 'nirandaperera' has created a pull request for this issue: https://github.com/apache/spark/pull/12506 > Deadlock in registering applications while the Master is in the RECOVERING > mode > --- > > Key: SPARK-14736 > URL: https://issues.apache.org/jira/browse/SPARK-14736 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.0, 1.6.0 > Environment: unix, Spark cluster with a custom > StandaloneRecoveryModeFactory and a custom PersistenceEngine >Reporter: niranda perera >Priority: Critical > > I have encountered the following issue in the standalone recovery mode. > Let's say there was an application A running in the cluster. Due to some > issue, the entire cluster, together with the application A goes down. > Then later on, cluster comes back online, and the master then goes into the > 'recovering' mode, because it sees some apps, workers and drivers have > already been in the cluster from Persistence Engine. While in the recovery > process, the application comes back online, but now it would have a different > ID, let's say B. > But then, as per the master, application registration logic, this application > B will NOT be added to the 'waitingApps' with the message ""Attempted to > re-register application at same address". [1] > private def registerApplication(app: ApplicationInfo): Unit = { > val appAddress = app.driver.address > if (addressToApp.contains(appAddress)) { > logInfo("Attempted to re-register application at same address: " + > appAddress) > return > } > The problem here is, master is trying to recover application A, which is not > in there anymore. Therefore after the recovery process, app A will be > dropped. However app A's successor, app B was also omitted from the > 'waitingApps' list because it had the same address as App A previously. > This creates a deadlock in the cluster, app A nor app B is available in the > cluster. > When the master is in the RECOVERING mode, shouldn't it add all the > registering apps to a list first, and then after the recovery is completed > (once the unsuccessful recoveries are removed), deploy the apps which are new? > This would sort this deadlock IMO? > [1] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14737) Kafka Brokers are down - spark stream should retry
Faisal created SPARK-14737: -- Summary: Kafka Brokers are down - spark stream should retry Key: SPARK-14737 URL: https://issues.apache.org/jira/browse/SPARK-14737 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Environment: Suse Linux, Cloudera Enterprise 5.4.8 (#7 built by jenkins on 20151023-1205 git: d7dbdf29ac1d57ae9fb19958502d50dcf4e4fffd), kafka_2.10-0.8.2.2 Reporter: Faisal I have spark streaming application that uses direct streaming - listening to KAFKA topic. {code} HashMapkafkaParams = new HashMap (); kafkaParams.put("metadata.broker.list", "broker1,broker2,broker3"); kafkaParams.put("auto.offset.reset", "largest"); HashSet topicsSet = new HashSet(); topicsSet.add("Topic1"); JavaPairInputDStream messages = KafkaUtils.createDirectStream( jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet ); {code} I notice when i stop/shutdown kafka brokers, my spark application also shutdown. Here is the spark execution script {code} spark-submit \ --master yarn-cluster \ --files /home/siddiquf/spark/log4j-spark.xml --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \ --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \ --class com.example.MyDataStreamProcessor \ myapp.jar {code} Spark job submitted successfully and i can track the application driver and worker/executor nodes. Everything works fine but only concern if kafka borkers are offline or restarted my application controlled by yarn should not shutdown? but it does. If this is expected behavior then how to handle such situation with least maintenance? Keeping in mind Kafka cluster is not in hadoop cluster and managed by different team that is why requires our application to be resilient enough. Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248627#comment-15248627 ] Simeon Simeonov edited comment on SPARK-10574 at 4/19/16 9:05 PM: -- [~josephkb] I agree that it would be an improvement. The issue I see with the current patch is that it would be an incompatible API change in the future (specifying hashing functions as objects and not by name). If we make just this one change everything else can be handled with no API changes, e.g., seeds are just constructor parameters or closure variables available to the hashing function and collision detection is just decoration. That's my practical argument related to MLlib. Beyond that, there are multiple arguments related to the usability, testability and maintainability of the Spark codebase, which has high code change velocity from a large number of committers, which contributes to a high issue rate. The simplest way to battle this is one design decision at a time. The PR hard-codes what is essentially a strategy pattern in the implementation of an object. It conflates responsibilities. It introduces branching this makes testing and documentation more complicated. If hashing functions are externalized, they could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as input it could also be tested much more simply with any function. The behavior and the APIs become simpler to document because they do one thing. Etc. Perhaps I'm only seeing the benefits of externalizing the hashing strategy and missing the complexity in what I propose? We have ample examples of Spark APIs using functions as inputs so there are standard ways to handle this across languages. We don't need a custom trait if we stick to {{Any}} as the hashing function input. What else could be a problem? was (Author: simeons): [~josephkb] I agree that it would be an improvement. The issue I see with the current patch is that it would be an incompatible API change in the future (specifying hashing functions as objects and not by name). If we make just this one change everything else can be handled with no API changes, e.g., seeds are just constructor parameters or closure variables available to the hashing function and collision detection is just decoration. That's my practical argument related to MLlib. Beyond that, there are multiple arguments related to the usability, testability and maintainability of the Spark codebase, which has high code change velocity from a large number of committers, which contributes to a high issue rate. The simplest way to battle this is one design decision at a time. The PR hard-codes what is essentially a strategy pattern in the implementation of an object. It conflates responsibilities. It introduces branching this makes testing and documentation more complicated. If hashing functions are externalized, they could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as input it could also be tested much more simply with any function. The documentation and the APIs become simpler to document because they do one thing. Etc. Perhaps I'm only seeing the benefits of externalizing the hashing strategy and missing the complexity in what I propose? We have ample examples of Spark APIs using functions as inputs so there are standard ways to handle this across languages. We don't need a custom trait if we stick to {{Any}} as the hashing function input. What else could be a problem? > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Assignee: Yanbo Liang > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, >
[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248627#comment-15248627 ] Simeon Simeonov commented on SPARK-10574: - [~josephkb] I agree that it would be an improvement. The issue I see with the current patch is that it would be an incompatible API change in the future (specifying hashing functions as objects and not by name). If we make just this one change everything else can be handled with no API changes, e.g., seeds are just constructor parameters or closure variables available to the hashing function and collision detection is just decoration. That's my practical argument related to MLlib. Beyond that, there are multiple arguments related to the usability, testability and maintainability of the Spark codebase, which has high code change velocity from a large number of committers, which contributes to a high issue rate. The simplest way to battle this is one design decision at a time. The PR hard-codes what is essentially a strategy pattern in the implementation of an object. It conflates responsibilities. It introduces branching this makes testing and documentation more complicated. If hashing functions are externalized, they could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as input it could also be tested much more simply with any function. The documentation and the APIs become simpler to document because they do one thing. Etc. Perhaps I'm only seeing the benefits of externalizing the hashing strategy and missing the complexity in what I propose? We have ample examples of Spark APIs using functions as inputs so there are standard ways to handle this across languages. We don't need a custom trait if we stick to {{Any}} as the hashing function input. What else could be a problem? > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Assignee: Yanbo Liang > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictionary-like strings > the hashing function choice will not be a big problem but why have an > implementation in MLlib with this limitation when there is a better > implementation readily available in the standard Scala library? > Switching to MurmurHash3 solves both problems. If there is agreement that > this is a good change, I can prepare a PR. > Note that changing the hash function would mean that models saved with a > previous version would have to be re-trained. This introduces a problem > that's orthogonal to breaking changes in APIs: breaking changes related to > artifacts, e.g., a saved model, produced by a previous version. Is there a > policy or best practice currently in effect about this? If not, perhaps we > should come up with a few simple rules about how we communicate these in > release notes, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode
[ https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niranda perera updated SPARK-14736: --- Affects Version/s: 1.5.0 1.6.0 > Deadlock in registering applications while the Master is in the RECOVERING > mode > --- > > Key: SPARK-14736 > URL: https://issues.apache.org/jira/browse/SPARK-14736 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.0, 1.6.0 > Environment: unix, Spark cluster with a custom > StandaloneRecoveryModeFactory and a custom PersistenceEngine >Reporter: niranda perera >Priority: Critical > > I have encountered the following issue in the standalone recovery mode. > Let's say there was an application A running in the cluster. Due to some > issue, the entire cluster, together with the application A goes down. > Then later on, cluster comes back online, and the master then goes into the > 'recovering' mode, because it sees some apps, workers and drivers have > already been in the cluster from Persistence Engine. While in the recovery > process, the application comes back online, but now it would have a different > ID, let's say B. > But then, as per the master, application registration logic, this application > B will NOT be added to the 'waitingApps' with the message ""Attempted to > re-register application at same address". [1] > private def registerApplication(app: ApplicationInfo): Unit = { > val appAddress = app.driver.address > if (addressToApp.contains(appAddress)) { > logInfo("Attempted to re-register application at same address: " + > appAddress) > return > } > The problem here is, master is trying to recover application A, which is not > in there anymore. Therefore after the recovery process, app A will be > dropped. However app A's successor, app B was also omitted from the > 'waitingApps' list because it had the same address as App A previously. > This creates a deadlock in the cluster, app A nor app B is available in the > cluster. > When the master is in the RECOVERING mode, shouldn't it add all the > registering apps to a list first, and then after the recovery is completed > (once the unsuccessful recoveries are removed), deploy the apps which are new? > This would sort this deadlock IMO? > [1] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode
niranda perera created SPARK-14736: -- Summary: Deadlock in registering applications while the Master is in the RECOVERING mode Key: SPARK-14736 URL: https://issues.apache.org/jira/browse/SPARK-14736 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Environment: unix, Spark cluster with a custom StandaloneRecoveryModeFactory and a custom PersistenceEngine Reporter: niranda perera Priority: Critical I have encountered the following issue in the standalone recovery mode. Let's say there was an application A running in the cluster. Due to some issue, the entire cluster, together with the application A goes down. Then later on, cluster comes back online, and the master then goes into the 'recovering' mode, because it sees some apps, workers and drivers have already been in the cluster from Persistence Engine. While in the recovery process, the application comes back online, but now it would have a different ID, let's say B. But then, as per the master, application registration logic, this application B will NOT be added to the 'waitingApps' with the message ""Attempted to re-register application at same address". [1] private def registerApplication(app: ApplicationInfo): Unit = { val appAddress = app.driver.address if (addressToApp.contains(appAddress)) { logInfo("Attempted to re-register application at same address: " + appAddress) return } The problem here is, master is trying to recover application A, which is not in there anymore. Therefore after the recovery process, app A will be dropped. However app A's successor, app B was also omitted from the 'waitingApps' list because it had the same address as App A previously. This creates a deadlock in the cluster, app A nor app B is available in the cluster. When the master is in the RECOVERING mode, shouldn't it add all the registering apps to a list first, and then after the recovery is completed (once the unsuccessful recoveries are removed), deploy the apps which are new? This would sort this deadlock IMO? [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14735) PySpark HashingTF hashAlgorithm param + docs
Joseph K. Bradley created SPARK-14735: - Summary: PySpark HashingTF hashAlgorithm param + docs Key: SPARK-14735 URL: https://issues.apache.org/jira/browse/SPARK-14735 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Joseph K. Bradley Add hashAlgorithm param to HashingTF in PySpark, and update docs to indicate default algorithm is MumurHash3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10574: -- Shepherd: Joseph K. Bradley > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Assignee: Yanbo Liang > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictionary-like strings > the hashing function choice will not be a big problem but why have an > implementation in MLlib with this limitation when there is a better > implementation readily available in the standard Scala library? > Switching to MurmurHash3 solves both problems. If there is agreement that > this is a good change, I can prepare a PR. > Note that changing the hash function would mean that models saved with a > previous version would have to be re-trained. This introduces a problem > that's orthogonal to breaking changes in APIs: breaking changes related to > artifacts, e.g., a saved model, produced by a previous version. Is there a > policy or best practice currently in effect about this? If not, perhaps we > should come up with a few simple rules about how we communicate these in > release notes, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14692) Error While Setting the path for R front end
[ https://issues.apache.org/jira/browse/SPARK-14692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248557#comment-15248557 ] Felix Cheung commented on SPARK-14692: -- is Sys.getenv("SPARK_HOME") valid? > Error While Setting the path for R front end > > > Key: SPARK-14692 > URL: https://issues.apache.org/jira/browse/SPARK-14692 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.1 > Environment: Mac OSX >Reporter: Niranjan Molkeri` > > Trying to set Environment path for SparkR in RStudio. > Getting this bug. > > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > > library(SparkR) > Error in library(SparkR) : there is no package called ‘SparkR’ > > sc <- sparkR.init(master="local") > Error: could not find function "sparkR.init" > In the directory which it is pointed. There is directory called SparkR. I > don't know how to proceed with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13448) Document MLlib behavior changes in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-13448: -- Description: This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can remember to add them to the migration guide / release notes. * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 to 1e-6. * SPARK-7780: Intercept will not be regularized if users train binary classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because it calls ML LogisticRegresson implementation. Meanwhile if users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate(because they run the same code route), this behavior is different from the old API. * SPARK-12363: Bug fix for PowerIterationClustering which will likely change results * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by default, if checkpointing is being used. * SPARK-12153: Word2Vec now respects sentence boundaries. Previously, it did not handle them correctly. * SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and spark.mllib was: This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can remember to add them to the migration guide / release notes. * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 to 1e-6. * SPARK-7780: Intercept will not be regularized if users train binary classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because it calls ML LogisticRegresson implementation. Meanwhile if users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate(because they run the same code route), this behavior is different from the old API. * SPARK-12363: Bug fix for PowerIterationClustering which will likely change results * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by default, if checkpointing is being used. * SPARK-12153: Word2Vec now respects sentence boundaries. Previously, it did not handle them correctly. * SPARK-10574: HashingTF uses MurmurHash3 by default > Document MLlib behavior changes in Spark 2.0 > > > Key: SPARK-13448 > URL: https://issues.apache.org/jira/browse/SPARK-13448 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can > remember to add them to the migration guide / release notes. > * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 > to 1e-6. > * SPARK-7780: Intercept will not be regularized if users train binary > classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, > because it calls ML LogisticRegresson implementation. Meanwhile if users set > without regularization, training with or without feature scaling will return > the same solution by the same convergence rate(because they run the same code > route), this behavior is different from the old API. > * SPARK-12363: Bug fix for PowerIterationClustering which will likely change > results > * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by > default, if checkpointing is being used. > * SPARK-12153: Word2Vec now respects sentence boundaries. Previously, it did > not handle them correctly. > * SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and > spark.mllib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13448) Document MLlib behavior changes in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-13448: -- Description: This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can remember to add them to the migration guide / release notes. * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 to 1e-6. * SPARK-7780: Intercept will not be regularized if users train binary classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because it calls ML LogisticRegresson implementation. Meanwhile if users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate(because they run the same code route), this behavior is different from the old API. * SPARK-12363: Bug fix for PowerIterationClustering which will likely change results * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by default, if checkpointing is being used. * SPARK-12153: Word2Vec now respects sentence boundaries. Previously, it did not handle them correctly. * SPARK-10574: HashingTF uses MurmurHash3 by default was: This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can remember to add them to the migration guide / release notes. * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 to 1e-6. * SPARK-7780: Intercept will not be regularized if users train binary classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because it calls ML LogisticRegresson implementation. Meanwhile if users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate(because they run the same code route), this behavior is different from the old API. * SPARK-12363: Bug fix for PowerIterationClustering which will likely change results * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by default, if checkpointing is being used. * SPARK-12153: Word2Vec now respects sentence boundaries. Previously, it did not handle them correctly. > Document MLlib behavior changes in Spark 2.0 > > > Key: SPARK-13448 > URL: https://issues.apache.org/jira/browse/SPARK-13448 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can > remember to add them to the migration guide / release notes. > * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 > to 1e-6. > * SPARK-7780: Intercept will not be regularized if users train binary > classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, > because it calls ML LogisticRegresson implementation. Meanwhile if users set > without regularization, training with or without feature scaling will return > the same solution by the same convergence rate(because they run the same code > route), this behavior is different from the old API. > * SPARK-12363: Bug fix for PowerIterationClustering which will likely change > results > * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by > default, if checkpointing is being used. > * SPARK-12153: Word2Vec now respects sentence boundaries. Previously, it did > not handle them correctly. > * SPARK-10574: HashingTF uses MurmurHash3 by default -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248514#comment-15248514 ] Joseph K. Bradley commented on SPARK-10574: --- Copying comments from [~simeons] from the PR: {quote} When the "hashing trick" is used in practice, it is important to do things such as monitor, manage or randomize collisions. If there are problems, it is not uncommon to vary the hashing function. All this suggests that a hashing function should be treated as an object with a simple interface, perhaps as simple as Function1[Any, Int]. Collision monitoring can then be performed with a decorator with an accumulator. Collision management would be performed by varying the seed or adding salt. Collision randomization would be performed by varying the seed/salt with each run and/or running multiple models in production which are identical expect for the different seed/salt used. The hashing trick is very important in ML and quite... tricky... to get working well for complex, high-dimension spaces, which Spark is perfect for. An implementation that does not treat the hashing function as a first class object would substantially hinder MLlib's capabilities in practice. {quote} --> This initial PR should be a big improvement, even if we just use MurmurHash3 without varied seed/salts like you're suggesting. This also seems acceptable for now since it's what scikit-learn does. But later PRs could add further improvements. > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Assignee: Yanbo Liang > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictionary-like strings > the hashing function choice will not be a big problem but why have an > implementation in MLlib with this limitation when there is a better > implementation readily available in the standard Scala library? > Switching to MurmurHash3 solves both problems. If there is agreement that > this is a good change, I can prepare a PR. > Note that changing the hash function would mean that models saved with a > previous version would have to be re-trained. This introduces a problem > that's orthogonal to breaking changes in APIs: breaking changes related to > artifacts, e.g., a saved model, produced by a previous version. Is there a > policy or best practice currently in effect about this? If not, perhaps we > should come up with a few simple rules about how we communicate these in > release notes, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org